Unit I DWDM
Unit I DWDM
UNIT – I (DWDM)
1. What is data mining?
Data mining should have been more appropriately named ―knowledge mining from data,‖
in short it is ―Knowledge mining,‖ many people treat data mining as a synonym for
another popularly used term, Knowledge Discovery from Data, or KDD.
Different views of Data mining
3) Transactional databases:
Transactional databases are a collection of data organized by time stamps,
date, etc to represent transaction in databases.
This type of database has the capability to roll back or undo its operation when
a transaction is not completed or committed.
Highly flexible system where users can modify information without changing
any sensitive information.
Follows ACID property of DBMS.
4) Advanced databases:
Object oriented
Object relational
Application oriented databases
• Spatial
• Temporal
• Time-Series
• Text
• Multimedia databases
Multimedia Databases:
Multimedia databases consists audio, video, images and text media.
Spatial Database:
Store geographical information.
Time-series Databases:
Time series databases contain stock exchange data and user logged
activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
Descriptive tasks derive patterns that summarize the underlying relationship in the
data. Ex: correlations, trends, clusters, trajectories and anomalies. These are in
explanatory in nature.
Predictive tasks perform inference on the current data to make predictions. i.e predict
the value of a particular attribute based on the values of other attributes. ex:
classification, regression.
Data mining functionalities, and the kinds of patterns they can discover, are described below:
The summarized descriptions of class or a concept are very much useful. Such
descriptions of a class or a concept are called class/concept descriptions. These
descriptions can be derived via (1) data characterization 2) data discrimination, (3) both
data characterization and discrimination
III Year - II Semester
Data Warehousing and Data Mining 5
Methods used for this are statistical measures, plots and OLAP operations.
The output of data characterization can be presented in various forms.
Ex: pie charts, bar charts, curves, multidimensional data cubes, and multidimensional
tables,
Data discrimination is comparison of the target class (the class under study) with one or a set
ofcomparative classes (called the contrasting classes).
Ex: the user may like to compare the general features of software products whose sales increased
by 10% in the last year with those whose sales decreased by at least 30% during the same period
Methods used and output presentation is same as characterization although discrimination
descriptions should include comparative measures that help distinguish between the
target and contrasting classes
2. Association Analysis
Association analysis is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data. This analysis is widely used for
market basket or transaction data analysis.
Association rules are of the form X => Y is interpreted as ―database tuples that satisfy
the conditions in X are also likely to satisfy the conditions in Y‖.
Ex: Marketing manager of AllElectronics, would like to determine which items are frequently
purchased together within the same transactions. An example of such a rule, mined from the
AllElectronics transactional database
buys(X, ―computer‖) => buys(X, ―software‖) [support = 1%; confidence = 50%]
model is based on the analysis of a set of training data (i.e., data objects whose class label is
known).
The derived model may be represented in various forms, such as classification (IF-
THEN) rules, decision trees, mathematical formulae, or neural networks
Fig: classification model can be represented in various forms, such as (a) IF-THEN rules,(b) a decision tree, or a (c) neural
network.
Ex: AllElectronics, items are classified into 3 classes good response, mild response and no
response. based on the descriptive features of the items based on price, brand, place made, type,
and category.
Predict missing or unavailable data values are referred as Prediction.
4. Evolution Analysis
It describes and models regularities or trends for objects whose behavior changes over
time, this may include characterization, discrimination, association and correlation analysis,
classification, prediction, clustering.
Ex: Stock market data analysis to predict the future trends using previous years data for decision
making regarding stock investments.
5. Cluster Analysis
Cluster is a group of similar data points or objects for analysis. The objects within a
cluster have high similarities in comparison to one another but are very dissimilar to objects in
other clusters.
Ex: Cluster AllElectronics customer data with respect to customer locations in a city. These
clusters may represent individual target groups for marketing.
Fig : 2-D plot of customer data with respect to customer locations in a city, showing three data clusters.
Each cluster ―center‖ is marked with a ―+‖.
6. Outlier Analysis
A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers (noise in the data) outliers may be detected
using statistical tests
Ex : Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges incurred
by the same account.
A statistical model is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their associated probability
distributions.
Statistical methods can also be used to verify data mining results. For example, after a
classification or prediction model is mined, the model should be verified by statistical
hypothesis testing. A statistical hypothesis test (sometimes called confirmatory data
analysis) makes statistical decisions using experimental data.
Machine learning investigates how computers can learn (or improve their performance)
based on data. A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data.
Active learning is a machine learning approach that lets users play an active role in the
learning process. An active learning approach can ask a user (e.g., a domain expert) to
label an example, which may be from a set of unlabeled examples or synthesized by the
learning program.
Database systems research focuses on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems researchers have established
highly recognized principles in data models, query languages, query processing and
optimization methods, data storage, and indexing and accessing methods. Database systems
are often well known for their high scalability in processing very large, relatively structured
data sets.
Data warehouse integrates data originating from multiple sources and various timeframes. It
consolidates data in multidimensional space to form partially materialized data cubes. The
data cube model not only facilitates OLAP in multidimensional databases but also promotes
multidimensional data mining.
queries are formed mainly by keywords, which do not have complex structures (unlike SQL
queries in database systems).
o classifying according to the type of data handled, we may have a spatial, time-
series, text, stream data, multimedia data mining system, or a WorldWideWeb
mining system.
Classification according to the kinds of knowledge mined:
Data mining systems can be categorized according to the kinds of knowledge
they mine, that is, based on data mining functionalities, such as characterization,
discrimination, association and correlation analysis, classification, prediction, clustering,
outlier analysis, and evolution analysis.
Data mining systems can also be categorized according to the applications they adapt.
For example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on.
III Year - II Semester
Data Warehousing and Data Mining 10
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues.
Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
processed in a parallel fashion. Then the results from the partitions are merged. The
incremental algorithms, update databases without mining the data again from scratch.
An attribute is a data field, representing a characteristic or feature of a data object. The nouns attribute,
dimension, feature, and variable are often used interchangeably in the literature.
The term dimension is commonly used in data warehousing. Machine learning literature tends to use
the term feature, while statisticians prefer the term variable. Data mining and database professionals
commonly use the term attribute, and we do here as well. Attributes describing a customer object can
include, for example, customer ID, name, and address. Observed values for a given attribute are known
as observations.
A set of attributes used to describe a given object is called an attribute vector (or feature vector). The
distribution of data involving one attribute (or variable) is called univariate. A bivariate distribution
involves two attributes, and so on.
The type of an attribute is determined by the set of possible value.
1. Nominal attribute
2. Binary attribute
3. Ordinal attribute
4. Numeric attribute
1. Nominal attribute:
Nominal means ―relating to names.‖ The values of a nominal attribute are symbols or names of things.
Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as
categorical. The values do not have any meaningful order. In computer science, the values are also known as
enumerations.
Ex: Suppose that hair color and marital status are two attributes describing person objects. In our
application, possible values for hair color are black, brown, blond, red, auburn, gray, and white. The attribute
marital status can take on the values single, married, divorced, and widowed.
2. Binary attribute:
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means
that the attribute is absent and 1 means that it is present. Binary attributes are referred to as Boolean if the
two states correspond to true and false.
Ex: The attribute medical test is binary, where a value of 1 means the result of the test for the patient is
positive, while 0 means the result is negative.
A binary attribute is symmetric if both of its states are equally valuable and carry the same weight. One
such example could be the attribute gender having the states male and female.
A binary attribute is asymmetric if the outcomes of the states are not equally important, such as the
positive and negative outcomes of a medical test for HIV. By convention, we code the most important
outcome, which is usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative).
3. Ordinal attribute:
An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among
them, but the magnitude between successive values is not known.
Ex: Suppose that drink size corresponds to the size of drinks available at a fast-food restaurant. This
nominal attribute has three possible values: small, medium, and large. The values have a meaningful sequence
(which corresponds to increasing drink size); however, we cannot tell from the values how much bigger.
4. Numeric attribute:
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values.
Numeric attributes can be interval-scaled or ratio-scaled.
a. Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-scaled
attributes have order and can be positive, 0, or negative.
Ex: A temperature attribute is interval-scaled. Suppose that we have the outdoor temperature value for a
number of different days, where each day is an object.
b. Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a measurement is
ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value. In addition, the
values are ordered, and we can also compute the difference between values, as well as the mean,
median, and mode.
Ex: examples of ratio-scaled attributes include count attributes such as years of experience (e.g., the
III Year - II Semester
Data Warehousing and Data Mining 14
objects are employees) and number of words (e.g., the objects are documents).
Discrete versus Continuous Attributes:
A discrete attribute has a finite or countably infinite set of values, which may or may not be
represented as integers. The attributes hair color, smoker, medical test, and drink size each have a finite
number of values, and so are discrete. Note that discrete attributes may have numeric values, such as 0
and 1 for binary attributes or, the values 0 to 110 for the attribute age.
If an attribute is not discrete, it is continuous. The terms numeric attribute and continuous attribute are
often used interchangeably in the literature. In practice, real values are represented using a finite number
of digits. Continuous attributes are typically represented as floating-point variables.
8. Basic Statistical Descriptions of Data
For data preprocessing to be successful, it is essential to have an overall picture of your data. Basic
statistical descriptions can be used to identify properties of the data and highlight which data values should be
treated as noise or outliers.
There are 3 areas of basic statistical descriptions:
1. Measures of central tendency - which measure the location of the middle or center of a data
distribution.
2. Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and
Interquartile Range.
3. Graphic Displays of Basic Statistical Descriptions of Data.
Ex: Suppose we have the following values for salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Sometimes, each value xi in a set may be associated with a weight wi for i = 1,...,N. The weights reflect the
significance, importance, or occurrence frequency attached to their respective values. In this case, we can
compute
For skewed (asymmetric) data, a better measure of the center of data is the median, which is the middle
value in a set of ordered data values. It is the value that separates the higher half of a data set from the lower
half.
Suppose that a given data set of N values for an attribute X is sorted in increasing order. If N is odd, then
the median is the middle value of the ordered set. If N is even, then the median is not unique; it is the two
middlemost values and any value in between. If X is a numeric attribute in this case, by convention, the median
is taken as the average of the two middlemost values.
Median. Let’s find the median of 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. The data are already sorted in
increasing order. There is an even number of observations (i.e., 12); therefore, the median is not unique. It can
be any value within the two middlemost values of 52 and 56 (that is, within the sixth and seventh values in the
list). By convention, we assign the average of the two middlemost values as the median; that is, (52+56)/2 = 108
/2 = 54. Thus, the median is $54,000. Suppose that we had only the first 11 values in the list. Given an odd
number of values, the median is the middlemost value. This is the sixth value in this list, which has a value of
$52,000.
The mode for a set of data is the value that occurs most frequently in the set. Therefore, it can be
determined for qualitative and quantitative attributes. It is possible for the greatest frequency to correspond to
several different values, which results in more than one mode. Data sets with one, two, or three modes are
respectively called unimodal, bimodal, and trimodal. In general, a data set with two or more modes is
multimodal. At the other extreme, if each data value occurs only once, then there is no mode.
Mode. The data 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 are bimodal. The two modes are $52,000 and
$70,000
The midrange can also be used to assess the central tendency of a numeric data set. It is the average of the
largest and smallest values in the set.
Midrange. The midrange of the data 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 is
(30,000+110,000)/2 = $70,000
2. Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile
Range.
Range : Let x1,x2,...,xN be a set of observations for some numeric attribute, X. The range of the set is the
difference between the largest (max()) and smallest (min()) values.
Quartiles: Suppose that the data for attribute X are sorted in increasing numeric order. Imagine that we can
pick certain data points so as to split the data distribution into equal-size consecutive sets, as in Following
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equalsize
consecutive sets.
The 2-quantile is the data point dividing the lower and upper halves of the data distribution. It corresponds
to the median.
The 100-quantiles are more commonly referred to as percentiles. They divide the data distribution into 100
equal-sized consecutive sets.
The quartiles give an indication of a distribution’s center, spread, and shape. The first quartile, denoted
by Q1, is the 25th percentile. It cuts off the lowest 25% of the data. The third quartile, denoted by Q3, is the
75th percentile—it cuts off the lowest 75% (or highest 25%) of the data. The second quartile is the 50th
percentile. As the median, it gives the center of the data distribution.
The distance between the first and third quartiles is a simple measure of spread that gives the range
covered by the middle half of the data. This distance is called the interquartile range (IQR) and is defined
as IQR = Q3 − Q1.
Ex : Interquartile range. The quartiles are the three values that split the sorted data set into four equal parts.
The data of Example 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 contain 12 observations, already sorted in
increasing order. Thus, the quartiles for this data are the third, sixth, and ninth values, respectively, in the sorted
list. Therefore, Q1 = $47,000 and Q3 is $63,000. Thus, the interquartile range is IQR = 63 − 47 = $16,000.
(Note that the sixth value is a median, $52,000, although this data set has two medians since the number of data
values is even.)
Where x¯ is the mean value of the observations. The standard deviation, σ, of the observations is the square
root of the variance, σ 2.
Ex: Variance and standard deviation. In Example for data 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 , we
found mean= x¯ = $58,000. To determine the variance and standard deviation of the data from that example, we
set N = 12
σ measures spread about the mean and should be considered only when the mean is chosen as the measure
of center.
σ = 0 only when there is no spread, that is, when all observations have the same value. Otherwise, σ > 0.
Example: Quantile plot. Following Figure shows a quantile plot for the unit price data of Table given below.
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution against the
corresponding quantiles of another. It is a powerful visualization tool in that it allows the user to view whether
there is a shift in going from one distribution to another.
Suppose that we have two sets of observations for the attribute or variable unit price, taken from two
different branch locations. Let x1,...,xN be the data from the first branch, and y1,..., yM be the data from the
second, where each data set is sorted in increasing order.
If M = N (i.e., the number of points in each set is the same), then we simply plot yi against xi , where yi and
xi are both (i − 0.5)/N quantiles of their respective data sets.
If M < N (i.e., the second branch has fewer observations than the first), there can be only M points on the q-
q plot. Here, yi is the (i − 0.5)/M quantile of the y data, which is plotted against the (i − 0.5)/M quantile of the x
data. This computation typically involves interpolation.
Histograms (or frequency histograms) are at least a century old and are widely used. ―Histos‖ means pole or
mast, and ―gram‖ means chart, so a histogram is a chart of poles. Plotting histograms is a graphical method for
summarizing the distribution of a given attribute, X. If X is nominal, such as automobile model or item type,
then a pole or vertical bar is drawn for each known value of X. The height of the bar indicates the frequency
(i.e., count) of that X value. The resulting graph is more commonly known as a bar chart.
The range of values for X is partitioned into disjoint consecutive subranges. The subranges, referred to as
buckets or bins, are disjoint subsets of the data distribution for X. The range of a bucket is known as the width.
For example, a price attribute with a value range of $1 to $200 (rounded up to the nearest dollar) can be
partitioned into subranges 1 to 20, 21 to 40, 41 to 60, and so on. For each subrange, a bar is drawn with a height
that represents the total count of items observed within the subrange.
A scatter plot is one of the most effective graphical methods for determining if there appears to be a
III Year - II Semester
Data Warehousing and Data Mining 19
relationship, pattern, or trend between two numeric attributes. To construct a scatter plot, each pair of
values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane.
9. Data Visualization: Data visualization aims to communicate data clearly and effectively
through graphical representation.
Data visualization approaches,
Pixel-oriented techniques
Geometric projection techniques
Icon-based techniques
Hierarchical and graph-based techniques.
Pixel-oriented techniques: For a data set of m dimensions, pixel-oriented techniques create m
windows on the screen, one for each dimension. The m dimension values of a record are
mapped to m pixels at the corresponding positions in the windows. The colors of the pixels
reflect the corresponding values.
10. Data Preprocessing: Data preprocessing is an important step in the data mining process.
It refers to the cleaning, transforming, and integrating of data in order to make it ready for analysis.
The goal of data preprocessing is to improve the quality of the data and to make it more suitable for
the specific data mining task.
1) Data Cleaning
2) Data Integration
4) Data Transformation
1) DATA Cleaning: Data cleaning routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Various methods for handling
this problem:
Handling Missing Values: The various methods for handling the problem of missing values in
data tuples include:
(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
(b) Manually filling in the missing value: In general, this approach is time-consuming and may
not be a reasonable task for large data sets with many missing values, especially when the value
to be filled in is not easily determined.
(c) Using a global constant to fill in the missing value: Replace all missing attribute values by the
same constant, such as a label like “Unknown,” or -∞. If missing values are replaced by, say,
“Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown.” Hence, although this
method is simple, it is not recommended.
(d) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical
(nominal) values, for all samples belonging to the same class as the given tuple: For example, if
classifying customers according to credit risk, replace the missing value with the average income
value for customers in the same credit risk category as that of the given tuple.
(e) Using the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using Bayesian formalism, or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a decision tree
to predict the missing values for income.
Handling Noisy data: Noise is a random error or variance in a measured variable. Data
smoothing tech is used for removing such noisy data.
a. Smoothing by bin means: Each value in the bin is replaced by the mean
value of the bin.
b. Smoothing by bin medians: Each value in the bin is replaced by the bin
median.
c. Smoothing by boundaries: The min and max values of a bin are identified as
the bin boundaries. Each bin value is replaced by the closest boundary value.
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in
this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin boundaries. Each bin
value is then replaced by the closest boundary value.
2 Clustering: Outliers in the data may be detected by clustering, where similar values are
organized into groups, or ‘clusters’. Values that fall outside of the set of clusters may be
considered outliers.
2) Data integration: Data integration in data mining refers to the process of combining data from
multiple sources into a single, unified view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may exist between the different sources. The
goal of data integration is to make the data more useful and meaningful for the purposes of analysis
and decision making. There are mainly 2 major approaches for data integration – one is the ―tight
coupling approach‖ and another is the ―loose coupling approach‖.
Tight Coupling: This approach involves creating a centralized repository or data warehouse to store
the integrated data. The data is extracted from various sources, transformed and loaded into a data
warehouse. Data is integrated in a tightly coupled manner, meaning that the data is integrated at a high
level, such as at the level of the entire dataset or schema. This approach is also known as data
warehousing, and it enables data consistency and integrity, but it can be inflexible and difficult to
change or update.
Loose Coupling: This approach involves integrating data at the lowest level, such as at the level of
individual data elements or records. Data is integrated in a loosely coupled manner, meaning that the
data is integrated at a low level, and it allows data to be integrated without having to create a central
repository or data warehouse. This approach is also known as data federation, and it enables data
flexibility and easy updates, but it can be difficult to maintain consistency and integrity across multiple
data sources.
Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.
III Year - II Semester
Data Warehousing and Data Mining 23
There are three issues to consider during data integration: Schema Integration, Redundancy
Detection, and resolution of data value conflicts. These are explained in brief below.
1. Schema Integration
2. Redundancy Detection:
An attribute may be redundant if it can be derived or obtained from another attribute or set of
attributes.
Inconsistencies in attributes can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
3) Data reduction: Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount of
irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining, including:
1. Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features into
a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset that
are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the size
of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.
III Year - II Semester
Data Warehousing and Data Mining 24
In conclusion, data reduction is an important step in data mining, as it can help to improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
However, it is important to be aware of the trade-off between the size and accuracy of the data, and
carefully assess the risks and benefits before implementing it.
1. Data Cube Aggregation: This technique is used to aggregate data in a simpler form. For example,
imagine the information you gathered for your analysis for the years 2012 to 2014, that data includes
the revenue of your company every three months. They involve you in the annual sales, rather than the
quarterly average, So we can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction: Whenever we come across any data which is weakly important, then we use
the attribute required for our analysis. It reduces data size as it eliminates outdated or redundant
features.
Step-wise Forward Selection: The selection begins with an empty set of attributes later on we
decide the best of the original attributes on the set based on their relevance to other attributes.
We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Step-wise Backward Selection: This selection starts with a set of complete attributes in the
original data and at each point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Combination of forwarding and Backward Selection: It allows us to remove the worst and
select the best attributes, saving time and making the process faster.
3. Data Compression: The data compression technique reduces the size of the files using different
encoding mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the precise
original data from the compressed data.
Lossy Compression: Methods such as the Discrete Wavelet transform technique, PCA
(principal component analysis) are examples of this compression. For e.g., the JPEG image
format is a lossy compression, but we can find the meaning equivalent to the original image. In
lossy-data compression, the decompressed data may differ from the original data but are useful
enough to retrieve information from them.
4. Numerosity Reduction: In this reduction technique, the actual data is replaced with mathematical
models or smaller representations of the data instead of actual data, it is important to only store the
model parameter. Or non-parametric methods such as clustering, histogram, and sampling.
4) Data Transformation: Data transformation in data mining refers to the process of converting
raw data into a format that is suitable for analysis and modeling. The goal of data transformation is to
prepare the data for data mining so that it can be used to extract useful insights and knowledge. Data
transformation typically involves several steps, including:
1. Smoothing: It is a process that is used to remove noise from the dataset using some algorithms. It
allows for highlighting important features present in the dataset. It helps in predicting the patterns.
When collecting data, it can be manipulated to eliminate or reduce any variance or any other noise
form. The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to look at a
lot of data which can often be difficult to digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data in a
summary format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis insights is
highly dependent on the quantity and quality of the data used. Gathering accurate data of high quality
and a large enough quantity is necessary to produce relevant results. The collection of data is useful for
everything from decisions concerning financing or business strategy of the product, pricing,
operations, and marketing strategies. For example, Sales, data may be aggregated to compute
monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small intervals. Most Data
Mining activities in the real world require continuous attributes. Yet many of the existing data mining
frameworks are unable to handle these attributes. Also, even if a data mining task can manage a
continuous attribute, it can significantly improve its efficiency by replacing a constant quality attribute
with its discrete values. For example, (1-10, 11-20) (age:- young, middle age, senior).
III Year - II Semester
Data Warehousing and Data Mining 26
4. Attribute Construction: Where new attributes are created & applied to assist the mining process
from the given set of attributes. This simplifies the original data & makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes using concept
hierarchy. For Example Age initially in Numerical form (22, 25) is converted into categorical value
(young, old). For example, Categorical attributes, such as house addresses, may be generalized to
higher-level definitions, such as town or country.
6. Normalization: Data normalization involves converting all data variables into a given range.
Techniques that are used for normalization are:
Min-Max Normalization:
o This transforms the original data linearly.
o Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
o Where v is the value you want to plot in the new range.
o v’ is the new value you get after normalizing the old value.
Z-Score Normalization:
o In z-score normalization (or zero-mean normalization) the values of an attribute (A), are
normalized based on the mean of A and its standard deviation
o A value, v, of attribute A is normalized to v’ by computing
Decimal Scaling:
o It normalizes the values of an attribute by changing the position of their decimal points
o The number of points by which the decimal point is moved can be determined by the
absolute maximum value of attribute A.
o A value, v, of attribute A is normalized to v’ by computing
o where j is the smallest integer such that Max(|v’|) < 1.
o Suppose: Values of an attribute P varies from -99 to 99.
o The maximum absolute value of P is 99.
o For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of
integers in the largest number) so that values come out to be as 0.98, 0.97 and so on.