Unit - III DW
Unit - III DW
Unit - III
Data Preprocessing Techniques in Data Mining
Introduction
Data processing is collecting raw data and translating it into usable information. The raw data is collected,
filtered, sorted, processed, analyzed, stored, and then presented in a readable format. It is usually performed in
a step-by-step process by a team of data scientists and data engineers in an organization.
The data processing is carried out automatically or manually. Nowadays, most data is processed automatically
with the help of the computer, which is faster and gives accurate results. Thus, data can be converted into
different forms. It can be graphic as well as audio ones. It depends on the software used as well as data
processing methods.
After that, the data collected is processed and then translated into a desirable form as per requirements, useful
for performing tasks. The data is acquired from Excel files, databases, text file data, and unorganized data such
as audio clips, images, GPRS, and video clips.
The most commonly used tools for data processing are Storm, Hadoop, HPCC, Statwing, Qubole,
and CouchDB. The processing of data is a key step of the data mining process. Raw data processing is a more
complicated task. The processing of data largely depends on the following things, such as:
The volume of data that needs to be processed.
The complexity of data processing operations.
Capacity and inbuilt technology of respective computer systems.
Technical skills and Time constraints.
Data cleaning
Data cleaning help us remove inaccurate, incomplete and incorrect data from the dataset. Some techniques used in data
cleaning are −
Handling missing values
This type of scenario occurs when some data is missing.
Standard values can be used to fill up the missing values in a manual way but only for a small dataset.
Attribute's mean and median values can be used to replace the missing values in normal and non-normal
distribution of data respectively.
Tuples can be ignored if the dataset is quite large and many values are missing within a tuple.
Page 1 of 14
Data Warehouse and Data Mining
Most appropriate value can be used while using regression or decision tree algorithms
Noisy Data
Noisy data are the data that cannot be interpreted by machine and are containing unnecessary faulty data. Some ways
to handle them are −
Binning − This method handle noisy data to make it smooth. Data gets divided equally and stored in form of
bins and then methods are applied to smoothing or completing the tasks. The methods are Smoothing by a bin
mean method(bin values are replaced by mean values), Smoothing by bin median(bin values are replaced by
median values) and Smoothing by bin boundary(minimum/maximum bin values are taken and replaced by
closest boundary values).
Regression − Regression functions are used to smoothen the data. Regression can be linear(consists of one
independent variable) or multiple(consists of multiple independent variables).
Clustering − It is used for grouping the similar data in clusters and is used for finding outliers.
Data integration
The process of combining data from multiple sources (databases, spreadsheets,text files) into a single dataset. Single
and consistent view of data is created in this process. Major problems during data integration are Schema
integration(Integrates set of data collected from various sources), Entity identification(identifying entities from different
databases) and detecting and resolving data values concept.
Data transformation
In this part, change in format or structure of data in order to transform the data suitable for mining process. Methods for
data transformation are −
Normalization − Method of scaling data to represent it in a specific smaller range( -1.0 to 1.0)
Discretization − It helps reduce the data size and make continuous data divide into intervals.
Attribute Selection − To help the mining process, new attributes are derived from the given attributes.
Concept Hierarchy Generation − In this, the attributes are changed from lower level to higher level in hierarchy.
Aggregation − In this, a summary of data gets stored which depends upon quality and quantity of data to make the
result more optimal.
Data reduction
It helps in increasing storage efficiency and reducing data storage to make the analysis easier by producing almost the
same results. Analysis becomes harder while working with huge amounts of data, so reduction is used to get rid of that.
Steps of data reduction are −
Data Compression
Data is compressed to make efficient analysis. Lossless compression is when there is no loss of data while compression.
loss compression is when unnecessary information is removed during compression.
Numerosity Reduction
There is a reduction in volume of data i.e. only store model of data instead of whole data, which provides smaller
representation of data without any loss of data.
Dimensionality reduction
In this, reduction of attributes or random variables are done so as to make the data set dimension low. Attributes are
combined without losing its original characteristics.
Page 2 of 14
Data Warehouse and Data Mining
Page 3 of 14
Data Warehouse and Data Mining
Data is processed with modern technologies using data processing software and programs. The software gives
a set of instructions to process the data and yield output. This method is the most expensive but provides the
fastest processing speeds with the highest reliability and accuracy of output.
Types of Data Processing
There are different types of data processing based on the source of data and the steps taken by the processing unit to
generate an output. There is no one size fits all method that can be used for processing raw data.
1. Batch Processing: In this type of data processing, data is collected and processed in batches. It is used for large
amounts of data. For example, the payroll system.
2. Single User Programming Processing: It is usually done by a single person for his personal use. This
technique is suitable even for small offices.
3. Multiple Programming Processing: This technique allows simultaneously storing and executing more than
one program in the Central Processing Unit (CPU). Data is broken down into frames and processed using two
or more CPUs within a single computer system. It is also known as parallel processing. Further, the multiple
programming techniques increase the respective computer's overall working efficiency. A good example of
multiple programming processing is weather forecasting.
4. Real-time Processing: This technique facilitates the user to have direct contact with the computer system. This
technique eases data processing. This technique is also known as the direct mode or the interactive mode
technique and is developed exclusively to perform one task. It is a sort of online processing, which always
remains under execution. For example, withdrawing money from ATM.
5. Online Processing: This technique facilitates the entry and execution of data directly; so, it does not store or
accumulate first and then process. The technique is developed to reduce the data entry errors, as it validates
data at various points and ensures that only corrected data is entered. This technique is widely used for online
applications. For example, barcode scanning.
6. Time-sharing Processing: This is another form of online data processing that facilitates several users to share
the resources of an online computer system. This technique is adopted when results are needed swiftly.
Moreover, as the name suggests, this system is time-based. Following are some of the major advantages of
time-sharing processing, such as:
7. Distributed Processing: This is a specialized data processing technique in which various computers (located
remotely) remain interconnected with a single host computer making a network of computers. All these
computer systems remain interconnected with a high-speed communication network. However, the central
Page 4 of 14
Data Warehouse and Data Mining
computer system maintains the master database and monitors accordingly. This facilitates communication
between computers.
Page 5 of 14
Data Warehouse and Data Mining
study the probability of each tuple in a multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.
ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume any model. The non-
Parametric technique results in a more uniform reduction, irrespective of data size, but it may not achieve a
high volume of data reduction like the parametric. There are at least four types of Non-Parametric data reduction
techniques, Histogram, Clustering, Sampling, Data Cube Aggregation, and Data Compression.
o Histogram: A histogram is a graph that represents frequency distribution which describes how often a
value appears in the data. Histogram uses the binning method to represent an attribute's data
distribution. It uses a disjoint subset which we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data. Instead of only one attribute, the
histogram can be implemented for multiple attributes. It can effectively represent up to five attributes.
o Clustering: Clustering techniques groups similar objects from the data so that the objects in a cluster
are similar to each other, but they are dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated using a distance function. More is
the similarity between the objects in a cluster closer they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max distance between any two
objects in the cluster.
o The cluster representation replaces the original data. This technique is more effective if the present data
can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can reduce the large data set
into a much smaller data sample. Below we will discuss the different methods in which we can sample
a large data set D containing N tuples:
o Simple random sample without replacement (SRSWOR) of size s: In this s, some tuples are
drawn from N tuples such that in the data set D (s<N). The probability of drawing any tuple
from the data set D is 1/N. This means all tuples have an equal probability of getting sampled.
o Simple random sample with replacement (SRSWR) of size s: It is similar to the SRSWOR, but
the tuple is drawn from data set D, is recorded, and then replaced into the data set D so that it
can be drawn again.
o Cluster sample: The tuples in data set D are clustered into M mutually disjoint subsets. The
data reduction can be applied by implementing SRSWOR on these clusters. A simple random
sample of size s could be generated from these clusters where s<M.
o Stratified sample: The large data set D is partitioned into mutually disjoint sets called 'strata'.
A simple random sample is taken from each stratum to get stratified data. This method is
effective for skewed data.
3. Data Cube Aggregation
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a multidimensional aggregation
that uses aggregation at various levels of a data cube to represent the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the year 2022. If you
want to get the annual sale per year, you just have to aggregate the sales per quarter for each year. In this way, aggregation
provides you with the required data, which is much smaller in size, and thereby we achieve data reduction even without
losing any data.
Page 6 of 14
Data Warehouse and Data Mining
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way that consumes less
space. Data compression involves building a compact representation of information by removing redundancy and
representing data in binary form. Data that can be restored successfully from its compressed form is called Lossless
compression. In contrast, the opposite where it is not possible to restore the original form from the compressed form is
Lossy compression. Dimensionality and numerosity reduction method are also used for data compression.
This technique reduces the size of the files using different encoding mechanisms, such as Huffman Encoding and run-
length Encoding. We can divide it into two types based on their compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed
data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from the original data but
are useful enough to retrieve information from them. For example, the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. Methods such as the Discrete
Wavelet transform technique PCA (principal component analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature into data with intervals. We
replace many constant values of the attributes with labels of small intervals. This means that mining results are shown
in a concise and easily understandable way.
i. Top-down discretization: If you first consider one or a couple of points (so-called breakpoints or split points)
to divide the whole set of attributes and repeat this method up to the end, then the process is known as top-down
discretization, also known as splitting.
ii. Bottom-up discretization: If you first consider all the constant values as split-points, some are discarded through
a combination of the neighborhood values in the interval. That process is called bottom-up discretization.
Page 7 of 14
Data Warehouse and Data Mining
Discretization in data mining
Data discretization refers to a method of converting a huge number of data values into smaller ones so that the
evaluation and management of data become easy. In other words, data discretization is a method of converting
attributes values of continuous data into a finite set of intervals with minimum data loss. There are two forms
of data discretization first is supervised discretization, and the second is unsupervised discretization. Supervised
discretization refers to a method in which the class data is used. Unsupervised discretization refers to a method
depending upon the way which operation proceeds. It means it works on the top-down splitting strategy and
bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Another example is analytics, where we gather the static data of website visitors. For example, all visitors who
visit the site with the IP address of India are shown under country level.
Some Famous techniques of data discretization
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a continuous data set. Histogram
assists the data inspection for data distribution. For example, Outliers, skewness representation, normal distribution
representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of continuous values into smaller values.
For data discretization and the development of idea hierarchy, this technique can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values of x numbers
into clusters to isolate a computational feature of x.
Data discretization and concept hierarchy generation
The term hierarchy represents an organizational structure or mapping in which items are ranked according to their levels
of importance. In other words, we can say that a hierarchy concept refers to a sequence of mappings with a set of more
general concepts to complex concepts. It means mapping is done from low-level concepts to high-level concepts. For
example, in computer science, there are different types of hierarchical systems. A document is placed in a folder in
windows at a specific place in the tree structure is the best example of a computer hierarchical tree model. There are
two types of hierarchy: top-down mapping and the second one is bottom-up mapping.
Page 8 of 14
Data Warehouse and Data Mining
Let's understand this concept hierarchy for the dimension location with the help of an example.
A particular city can map with the belonging country. For example, New Delhi can be mapped to India, and India
can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends with the bottom to the
specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and ends with the top
to the generalized information.
Page 9 of 14
Data Warehouse and Data Mining
underlying objective class working connection and starting a differentiating class working connection. The
attributes are then sorted(i.e., ranked )according to their computed relevance to the data mining task.
4. Generate the concept description using AOI – Perform AOI utilizing a less Conservative arrangement of
characteristic speculation limits. In the event that the unmistakable mining Task is a class portrayal, just the
underlying objective class working connection is incorporated here. On the off chance that the expressive
mining task is a class examination, both the underlying objective class working connection and the underlying
differentiating class working connection are incorporated.
Relevance Measure Components :
1. Information Gain (ID3)
2. Gain Ratio (C4.5)
3. Gini Index
4. Chi^2 contingency table statistics
5. Uncertainty Coefficient
Page 10 of 14
Data Warehouse and Data Mining
1. Data Collection: The set of relevant data in the database and data warehouse is collected by query Processing
and partitioned into a target class and one or a set of contrasting classes.
2. Dimension relevance analysis: If there are many dimensions and analytical comparisons are desired, then
dimension relevance analysis should be performed. Only the highly relevant dimensions are included in the
further analysis.
3. Synchronous Generalization: The process of generalization is performed upon the target class to the level
controlled by the user or expert specified dimension threshold, which results in a prime target class relation or
cuboid. The concepts in the contrasting class or classes are generalized to the same level as those in the prime
target class relation or cuboid, forming the prime contrasting class relation or cuboid.
4. Presentation of the derived comparison: The resulting class comparison description can be visualized in the
form of tables, charts, and rules. This presentation usually includes a "contrasting" measure (such as count%)
that reflects the comparison between the target and contrasting classes. As desired, the user can adjust the
comparison description by applying drill-down, roll-up, and other OLAP operations to the target and contrasting
classes.
5. For example, the task we want to perform is to compare graduate and undergraduate students using the
discriminant rule. So to do this, the DMQL query would be as follows.
1. use University_Database
2. mine comparison as "graduate_students vs_undergraduate_students"
3. in relevance to name, gender, program, birth_place, birth_date, residence, phone_no, GPA
4. for "graduate_students"
5. where status in "graduate"
6. versus "undergraduate_students"
7. where status in "undergraduate"
8. analyze count%
9. from student
Data Marts –
What is Data Mart?
A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose or primary data subject
which may be distributed to provide business needs. Data Marts are analytical record stores designed to focus on
particular business functions for a specific community within an organization. Data marts are derived from subsets of
data in a data warehouse, though in the bottom-up data warehouse design methodology, the data warehouse is created
from the union of organizational data marts.
Page 11 of 14
Data Warehouse and Data Mining
The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to gather, store, access, and
analyze record. It can be used by smaller businesses to utilize the data they have accumulated since it is less expensive
than implementing a data warehouse.
Page 12 of 14
Data Warehouse and Data Mining
Independent Data Marts
The second approach is Independent data marts (IDM) Here, firstly independent data marts are created, and then a data
warehouse is designed using these independent multiple data marts. In this approach, as all the data marts are designed
independently; therefore, the integration of data marts is required. It is also termed as a bottom-up approach as the data
marts are integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could be helpful for many situations;
especially when Adhoc integrations are needed, such as after a new group or product is added to the organizations.
Page 13 of 14
Data Warehouse and Data Mining
Self-Contained: Data marts are self-contained, which means that they have their own set of tables, indexes, and data
models. This allows for easier management and maintenance of the data mart.
Security: Data marts can be secured, which means that access to the data in the data mart can be controlled and restricted
to specific users or groups.
Scalability: Data marts can be scaled horizontally or vertically to accommodate larger volumes of data or to support
more users.
Integration with Business Intelligence Tools: Data marts can be integrated with business intelligence tools, such as
Tableau, Power BI, or QlikView, which allows users to analyze and visualize the data stored in the data mart.
ETL Process: Data marts are typically populated using an Extract, Transform, Load (ETL) process, which means that
data is extracted from the larger data warehouse or data lake, transformed to meet the requirements of the data mart,
and loaded into the data mart.
Page 14 of 14