DWM Module 2
DWM Module 2
Data Preprocessing: Data Preprocessing Concepts, Data Cleaning,Handling Missing Data, Data
Transformation and Discretization, Data Visualization. UCI Data Sets and Their Significance
KTU QUESTIONS
1 a) Which are the methods to handle missing values during data mining? (5)
b) What is Data Cleaning? How can we use binning to handle noisy data? (5)
c) What is data visualization? (2)
d) Why data visualization is important in data mining? List out the softwares used for data
visualization? (3)
2 b) Explain the data transformation and discretization methods. (5)
1 b) Briefly explain the common types of data transformation techniques with suitable
examples.(5)
2 b) Discuss the issues to be considered during data cleaning. Explain how to handle noisy data
in data cleaning process. (6)
3 a) Explain the relevance of data preprocessing in datamining. Explain the methods to handle
missing values in a data set before mining process? (6)
1 a) What are the major challenges of mining a huge amount of data in comparison with mining
a small amount of data? (5)
2 b) Use the two methods below to normalize the following group of data: 200, 300, 400, 600,
1000
i) min-max normalization by setting min=0 and max=1
ii) z-score normalization (7)
DATA PREPROCESSING
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, lacking in certain
behaviors or trends, and is likely to contain many errors.
Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares
raw data for further processing.
Data preprocessing is used in database-driven applications such as customer relationship
management and rule-based applications (like neural networks).
In Machine Learning (ML) processes, data preprocessing is critical to encode the dataset in a
form that could be interpreted and parsed by the algorithm.
Measures for data quality/important factors for data quality:
1. accuracy
2. completeness
3. consistency
4. timeliness
5. believability
6. Interpretability
Data preprocessing tasks
1. Data cleaning -> to remove noise/errors and inconsistent data which lacks
attributes/discrepancies in codes/names
2. Data integration -> where multiple data sources may be combined
3. Data reduction -> where data relevant to the analysis task are retrieved from the database
4. Data transformation & data discretization-> where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation operations
<1> Data CLeaning ---> Smoothing noisy data is particularly important for ML datasets, since
machines cannot make use of data they cannot interpret. Data can be cleaned by dividing it into
equal size segments that are thus smoothed (binning), by fitting it to a linear or multiple
regression function (regression), or by grouping it into clusters of similar data (clustering).
<2> Data Integration --> Integration of multiple databases, data cubes, or files. Data with
different representations are put together and conflicts within the data are resolved.
<3> Data Reduction --> Data is normalized and generalized. Normalization is a process that
ensures that no data is redundant, it is all stored in a single place, and all the dependencies are
logical. Data reduction step aims to present a reduced representation of the data in a data
warehouse. There are various methods to reduce data. Encoding mechanisms can be used to
reduce the size of data as well. If all original data can be recovered after compression, the
operation is labeled as lossless. If some data is lost, then it’s called a lossy reduction. Data
could also be discretized to replace raw values with interval levels. This step involves the
reduction of a number of values of a continuous attribute by dividing the range of attribute
intervals.
<4> Data transformation & data discretization --> Data discretization transforms numeric
data by mapping values to interval or concept labels.
DATA CLEANING
data in real world may contain tons of incorrect and faulty data. The data can have many
irrelevant and missing parts. To handle this part, data cleaning is done.
types of data-
1. Missing data/incomplete->
data which lacks attributes/attribute values/only aggregate data. This situation arises when
some data is missing in the data. It can be handled in various ways.
for eg: salary=" " (missing data)
2. noisy->
Noisy data is a meaningless data that can’t be interpreted by machines.It is generated due to
faulty data collection, data entry errors etc. It contains noise errors/outliers
eg:salary="-10"
3. inconsistent->
data with discrepancies with different data items. Some attributes representing a given concept
may have different names in different databases, causing inconsistencies and redundancies.
Naming inconsistencies may also occur for attribute values.
For eg. Age =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″
4. intentional->
Sometimes applications a lot auto value to attribute. e.g some application put gender
value as male by default. gender=”male”. These are disguised as missing data
Data integration:
Combines data from multiple heterogeneous data sources to a coherent data store and provide
a unified view of data. These sources may include multiple data cubes, databases or flat files.
It is necessary to Detect and resolve data value conflicts like when the same real world entity,
attribute values from different sources are different. The possible reasons could be different
representations, different scales, etc e.g., metric vs. British units. Data integration causes
several issues such as data redundancy, inconsistency, duplicity and many more.
Data redundancy:
Redundant data occurs when multiple databases are integrated together.
>object identification: it occurs when multiple databases are integrated and same object may
have different names in diff databases
>Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual
revenue
Redundant data can be detected by correlation analysis and covariance analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
DATA TRANSFORMATION
A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
Methods:
1. Smoothing: Remove noise from data. Techniques include binning, regression, and
clustering. It is a process that is used to remove noise from the dataset using some algorithms It
allows for highlighting important features present in the dataset. It helps in predicting the
patterns. When collecting data, it can be manipulated to eliminate or reduce any variance or any
other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to
look at a lot of data which can often be difficult to digest for finding patterns that they wouldn’t
see otherwise.
4. Normalization: where attributes are scaled to fall within a smaller, specified range
a) min-max normalization: to [new_minA, new_maxA]
min_A is the minima and max_A
is the maxima of an attribute, P
Where v is the value you want to plot in the new range and v’ is the new value you get after
normalizing the old value.
Eg : Suppose that the recorded values of A range from −986 to 917. The maximum absolute
value of A is 986. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e.,
j=3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917.
DATA DISCRETIZATION
It can be used to reduce the number of values for a given continuous attribute by dividing the
range of the attribute into interval and these interval values can be used to replace actual data
values.
Discretization techniques can be categorized based on how it is performed.
1. Based on direction it proceeds we can classify into top-down approach and bottom-up
approach
● If the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals,
it is called top-down discretization or splitting.
● On bottom-up discretization or merging, which starts by considering all of the continuous
values as potential split points, removes some by merging neighborhood values to form
intervals, and then recursively applies this process to the resulting intervals.
2. If discretization process uses class information its called supervised else unsupervised.
Methods :
1. Binning is a top-down splitting technique based on a specified number of bins
● Eg : Attribute values can be discretized by applying equal-width or equal-frequency
binning, and then replacing each bin value by the bin mean or median. These
techniques can be applied recursively to the resulting partitions to generate concept
hierarchies.
2. Histogram Analysis
● A histogram partitions the values of an attribute, A, into disjoint ranges called buckets or
bins.
3. Discretization by Cluster, Decision Tree, and Correlation Analysis
● A clustering algorithm can be applied to discretize a numeric attribute, A, by partitioning
the values of A into clusters or groups.
● Clustering takes the distribution of A into consideration, as well as the closeness of data
points, and therefore is able to produce high-quality discretization results.
● Techniques to generate decision trees for Classification can be applied to discretization.
Such techniques employ a top-down splitting approach.
● Measures of correlation can be used for discretization. Chi Merge is a χ2 based
discretization method. Chi Merge, which employs a bottom-up approach by finding the
best neighboring intervals and then merging them to form larger intervals, recursively.
DATA VISUALISATION
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.
In Big Data, data visualization tools and technologies are essential to analyze massive amounts
of information and make data-driven decisions.
The advantage of data visualization is its ability to attract the attention of readers. Data
visualization helps to understand and analyse data better in order to read into and identify
trends and outliers and efficiently internalize it. With the increasing amount of data that is being
generated everyday, data visualisation is an important tool in order to curate the data formed by
highlighting important trends and outliers while at the same time removing noise from the data.
An effective data visualisation is a combination of form and function by making sure the graph
isn't too loud or too bland.
To craft an effective data visualization, you need to start with clean data that is well-sourced and
complete. After the data is ready to visualize, you need to pick the right chart.
After you have decided the chart type, you need to design and customize your visualization to
your liking. Simplicity is essential - you don't want to add any elements that distract from the
data.
Why Use Data Visualization?
1. To make easier in understand and remember.
2. To discover unknown facts, outliers, and trends.
3. To visualize relationships and patterns quickly.
4. To ask a better question and make better decisions.
5. To competitive analysis.
6. To improve insights.
Common general types of data visualization:
● Charts
● Tables
● Graphs
● Maps
● Infographics
● Dashboards
The best data visualization tools include Google Charts, Tableau, Grafana, Chartist. js,
FusionCharts, Datawrapper, Infogram, ChartBlocks, and D3. js. The best tools offer a variety of
visualization styles, are easy to use, and can handle large data sets.