0% found this document useful (0 votes)
58 views9 pages

DWM Module 2

The document discusses data preprocessing concepts including data cleaning, handling missing data, data transformation, discretization and visualization. It covers common challenges in data preprocessing like missing values, noisy data, and inconsistencies. Methods to handle these include binning, regression, clustering and integrating data from multiple sources.

Uploaded by

sandriapgot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views9 pages

DWM Module 2

The document discusses data preprocessing concepts including data cleaning, handling missing data, data transformation, discretization and visualization. It covers common challenges in data preprocessing like missing values, noisy data, and inconsistencies. Methods to handle these include binning, regression, clustering and integrating data from multiple sources.

Uploaded by

sandriapgot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

DWM MODULE 2

Data Preprocessing: Data Preprocessing Concepts, Data Cleaning,Handling Missing Data, Data
Transformation and Discretization, Data Visualization. UCI Data Sets and Their Significance

KTU QUESTIONS
1 a) Which are the methods to handle missing values during data mining? (5)
b) What is Data Cleaning? How can we use binning to handle noisy data? (5)
c) What is data visualization? (2)
d) Why data visualization is important in data mining? List out the softwares used for data
visualization? (3)
2 b) Explain the data transformation and discretization methods. (5)
1 b) Briefly explain the common types of data transformation techniques with suitable
examples.(5)
2 b) Discuss the issues to be considered during data cleaning. Explain how to handle noisy data
in data cleaning process. (6)
3 a) Explain the relevance of data preprocessing in datamining. Explain the methods to handle
missing values in a data set before mining process? (6)
1 a) What are the major challenges of mining a huge amount of data in comparison with mining
a small amount of data? (5)
2 b) Use the two methods below to normalize the following group of data: 200, 300, 400, 600,
1000
i) min-max normalization by setting min=0 and max=1
ii) z-score normalization (7)

DATA PREPROCESSING
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, lacking in certain
behaviors or trends, and is likely to contain many errors.
Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares
raw data for further processing.
Data preprocessing is used in database-driven applications such as customer relationship
management and rule-based applications (like neural networks).
In Machine Learning (ML) processes, data preprocessing is critical to encode the dataset in a
form that could be interpreted and parsed by the algorithm.
Measures for data quality/important factors for data quality:
1. accuracy
2. completeness
3. consistency
4. timeliness
5. believability
6. Interpretability
Data preprocessing tasks
1. Data cleaning -> to remove noise/errors and inconsistent data which lacks
attributes/discrepancies in codes/names
2. Data integration -> where multiple data sources may be combined
3. Data reduction -> where data relevant to the analysis task are retrieved from the database
4. Data transformation & data discretization-> where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation operations

<1> Data CLeaning ---> Smoothing noisy data is particularly important for ML datasets, since
machines cannot make use of data they cannot interpret. Data can be cleaned by dividing it into
equal size segments that are thus smoothed (binning), by fitting it to a linear or multiple
regression function (regression), or by grouping it into clusters of similar data (clustering).

<2> Data Integration --> Integration of multiple databases, data cubes, or files. Data with
different representations are put together and conflicts within the data are resolved.

<3> Data Reduction --> Data is normalized and generalized. Normalization is a process that
ensures that no data is redundant, it is all stored in a single place, and all the dependencies are
logical. Data reduction step aims to present a reduced representation of the data in a data
warehouse. There are various methods to reduce data. Encoding mechanisms can be used to
reduce the size of data as well. If all original data can be recovered after compression, the
operation is labeled as lossless. If some data is lost, then it’s called a lossy reduction. Data
could also be discretized to replace raw values with interval levels. This step involves the
reduction of a number of values of a continuous attribute by dividing the range of attribute
intervals.

<4> Data transformation & data discretization --> Data discretization transforms numeric
data by mapping values to interval or concept labels.
DATA CLEANING
data in real world may contain tons of incorrect and faulty data. The data can have many
irrelevant and missing parts. To handle this part, data cleaning is done.
types of data-
1. Missing data/incomplete->
data which lacks attributes/attribute values/only aggregate data. This situation arises when
some data is missing in the data. It can be handled in various ways.
for eg: salary=" " (missing data)
2. noisy->
Noisy data is a meaningless data that can’t be interpreted by machines.It is generated due to
faulty data collection, data entry errors etc. It contains noise errors/outliers
eg:salary="-10"
3. inconsistent->
data with discrepancies with different data items. Some attributes representing a given concept
may have different names in different databases, causing inconsistencies and redundancies.
Naming inconsistencies may also occur for attribute values.
For eg. Age =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″
4. intentional->
Sometimes applications a lot auto value to attribute. e.g some application put gender
value as male by default. gender=”male”. These are disguised as missing data

HANDLING MISSING DATA


>This arises when some data not available or missing
>equipment malfunctions can result in missing data
>the data may be inconsistent with other recorded data and hence should be deleted
It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
PROS: Complete removal of data with missing values results in robust and highly
accurate model. Deleting a particular row or a column with no specific information is
better, since it does not have a high weightage.
Cons: Loss of information and data. Works poorly if the percentage of missing values is
high (say 30%), compared to the whole dataset
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value. Fill the values automatically by
either
a. Getting the attribute mean
PROS:
● Replacing value with a mean is a better approach when the dataset is small
● can prevent data loss which results in removal of the rows and columns.
CONS:
● Imputing the approximations add variance and bias
● Works poorly compared to other multiple-imputations method
b. Getting the constant value if any constant value is there.
c. Getting the most probable value by Bayesian formula or decision tree

Handle noisy data


>random error in a measured value results in noisy data
> this is commonly due to faulty data collection, data transmission problems, technology
limitation or inconsistency in naming convention
It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segment is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task. The method sorts the data and
partitions it into equal frequency bins
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
4. Combined computer and human inspection
Summary of methods:
1. binning-sort data and partition to equal frequency bins
-smooth by bin mean/bin median/bin boundaries
2. regression-smooth by fitting data to regression functions
3. clustering-detect and remove outliers
4. detect suspicious values and check by human

What is importance and benefits of data cleaning


1. Data Cleaning removes major errors.
2. Data Cleaning ensures happier customers, more sales, and more accurate decision.
3. Data Cleaning removes inconsistencies that are most likely occur when multiple sources of
data are store into one data-set.
4. Data Cleaning make the data-set more efficient, more reliable and more accurate

Data cleaning tools


There are many data cleaning tools. Here, i am sharing with you top 10 data cleaning toolsl.
1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
9. TIBCO Clarity
10. Winpure

Data integration:
Combines data from multiple heterogeneous data sources to a coherent data store and provide
a unified view of data. These sources may include multiple data cubes, databases or flat files.
It is necessary to Detect and resolve data value conflicts like when the same real world entity,
attribute values from different sources are different. The possible reasons could be different
representations, different scales, etc e.g., metric vs. British units. Data integration causes
several issues such as data redundancy, inconsistency, duplicity and many more.
Data redundancy:
Redundant data occurs when multiple databases are integrated together.
>object identification: it occurs when multiple databases are integrated and same object may
have different names in diff databases
>Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual
revenue
Redundant data can be detected by correlation analysis and covariance analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality

DATA TRANSFORMATION
A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
Methods:

1. Smoothing: Remove noise from data. Techniques include binning, regression, and
clustering. It is a process that is used to remove noise from the dataset using some algorithms It
allows for highlighting important features present in the dataset. It helps in predicting the
patterns. When collecting data, it can be manipulated to eliminate or reduce any variance or any
other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to
look at a lot of data which can often be difficult to digest for finding patterns that they wouldn’t
see otherwise.

2. Attribute/feature construction : New attributes constructed from the given ones


Where new attributes are created & applied to assist the mining process from the given set of
attributes. This simplifies the original data & makes the mining more efficient.

3. Aggregation: Summarization, data cube construction.


Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used. Gathering accurate
data of high quality and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning financing or business
strategy of the product, pricing, operations, and marketing strategies.

4. Normalization: where attributes are scaled to fall within a smaller, specified range
a) min-max normalization: to [new_minA, new_maxA]
min_A is the minima and max_A
is the maxima of an attribute, P
Where v is the value you want to plot in the new range and v’ is the new value you get after
normalizing the old value.

Eg. Let income range $12,000 to $98,000 normalized to


[0.0, 1.0]. Then $73,000 is mapped to

b) z-score normalization:(μ: mean, σ: standard deviation)


In z-score normalization (or zero-mean normalization) the values of an attribute (A), are
normalized based on the mean of A and its standard deviation A value, v, of attribute A is
normalized to v’ by computing

Eg. Let μ = 54,000, σ = 16,000. Then


73,000 is some value

c) normalization by decimal scaling:


It normalizes the values of an attribute by changing the position of their decimal points. The
number of points by which the decimal point is moved can be determined by the absolute
maximum value of attribute A. A value, v, of attribute A is normalized to v’ by computing

Where j is the smallest integer such that Max(|ν’|) < 1


Eg : Suppose: Values of an attribute P varies from -99 to 99.
The maximum absolute value of P is 99.
For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers in
the largest number) so that values come out to be as 0.98, 0.97 and so on.

Eg : Suppose that the recorded values of A range from −986 to 917. The maximum absolute
value of A is 986. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e.,
j=3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917.

5. Discretization: Concept hierarchy climbing

DATA DISCRETIZATION
It can be used to reduce the number of values for a given continuous attribute by dividing the
range of the attribute into interval and these interval values can be used to replace actual data
values.
Discretization techniques can be categorized based on how it is performed.
1. Based on direction it proceeds we can classify into top-down approach and bottom-up
approach
● If the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals,
it is called top-down discretization or splitting.
● On bottom-up discretization or merging, which starts by considering all of the continuous
values as potential split points, removes some by merging neighborhood values to form
intervals, and then recursively applies this process to the resulting intervals.
2. If discretization process uses class information its called supervised else unsupervised.

Three types of attributes “


● Nominal—values from an unordered set, e.g., color, profession
● Ordinal—values from an ordered set, e.g., military or academic rank
● Numeric—real numbers, e.g., integer or real numbers

Concept Hierarchy Generation


• Concept hierarchy are used to reduce data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as youth, adult, or senior).
• Although details are lost by such data generalization, the generalized data may be more
meaningful and easier to interpret.

Methods :
1. Binning is a top-down splitting technique based on a specified number of bins
● Eg : Attribute values can be discretized by applying equal-width or equal-frequency
binning, and then replacing each bin value by the bin mean or median. These
techniques can be applied recursively to the resulting partitions to generate concept
hierarchies.
2. Histogram Analysis
● A histogram partitions the values of an attribute, A, into disjoint ranges called buckets or
bins.
3. Discretization by Cluster, Decision Tree, and Correlation Analysis
● A clustering algorithm can be applied to discretize a numeric attribute, A, by partitioning
the values of A into clusters or groups.
● Clustering takes the distribution of A into consideration, as well as the closeness of data
points, and therefore is able to produce high-quality discretization results.
● Techniques to generate decision trees for Classification can be applied to discretization.
Such techniques employ a top-down splitting approach.
● Measures of correlation can be used for discretization. Chi Merge is a χ2 based
discretization method. Chi Merge, which employs a bottom-up approach by finding the
best neighboring intervals and then merging them to form larger intervals, recursively.

DATA VISUALISATION
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.
In Big Data, data visualization tools and technologies are essential to analyze massive amounts
of information and make data-driven decisions.
The advantage of data visualization is its ability to attract the attention of readers. Data
visualization helps to understand and analyse data better in order to read into and identify
trends and outliers and efficiently internalize it. With the increasing amount of data that is being
generated everyday, data visualisation is an important tool in order to curate the data formed by
highlighting important trends and outliers while at the same time removing noise from the data.
An effective data visualisation is a combination of form and function by making sure the graph
isn't too loud or too bland.
To craft an effective data visualization, you need to start with clean data that is well-sourced and
complete. After the data is ready to visualize, you need to pick the right chart.
After you have decided the chart type, you need to design and customize your visualization to
your liking. Simplicity is essential - you don't want to add any elements that distract from the
data.
Why Use Data Visualization?
1. To make easier in understand and remember.
2. To discover unknown facts, outliers, and trends.
3. To visualize relationships and patterns quickly.
4. To ask a better question and make better decisions.
5. To competitive analysis.
6. To improve insights.
Common general types of data visualization:
● Charts
● Tables
● Graphs
● Maps
● Infographics
● Dashboards

The best data visualization tools include Google Charts, Tableau, Grafana, Chartist. js,
FusionCharts, Datawrapper, Infogram, ChartBlocks, and D3. js. The best tools offer a variety of
visualization styles, are easy to use, and can handle large data sets.

You might also like