0% found this document useful (0 votes)

54 views82 pages

3 Ravi

The document discusses the importance of data preprocessing before data mining or analysis. It explains that real-world data is often dirty, containing missing values, noise, inconsistencies, and duplicates. This low-quality data can negatively impact mining results. The major tasks of data preprocessing are described as data cleaning, integration, transformation, reduction, and discretization to handle these issues and improve data quality. Specific techniques for data cleaning like filling in missing values, identifying outliers, and resolving inconsistencies are also outlined.

Uploaded by

Krishna Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views82 pages

3 Ravi

Uploaded by

Krishna Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Data Pre-processing

Prof. Ravi Patel

IT Department
ADIT
Why preprocessing?
• Today’s real-world databases are highly
susceptible to noisy, missing, and inconsistent
data due to their typically huge size (often several
gigabytes or more) and their likely origin from
multiple, heterogeneous sources.
• Data have quality if they satisfy the requirements
of the intended use. There are many factors
comprising data quality, including accuracy,
completeness, consistency, timeliness,
believability, and interpretability.
• Low-quality data will lead to low-quality mining
results.
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
• e.g., occupation=“ ”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Data is dirty because…
• Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected
and when it is analyzed.
– Human/hardware/software problems
• Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission
• Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
Why Is Data Preprocessing Important?
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
– Data warehouse needs consistent integration of quality
data
• Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
Data Cleaning

• Data cleaning tasks

– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Data Cleaning process for missing
values
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred.
Methods:
1> Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
2> Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with many missing
values.
3> Fill in it automatically with a global constant : e.g., “unknown”, a new
class?!
4> Use a measure of central tendency for the attribute (e.g., the mean or
median) to fill in the missing value
5> Use the attribute mean or median for all samples belonging to the same
class as the given tuple
6> Use the most probable value to fill in the missing value
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal
with possible outliers)
Binning
• Binning: Binning methods smooth a sorted
data value by consulting its “neighborhood,”
that is, the values around it. The sorted values
are distributed into a number of “buckets,” or
bins. Because binning methods consult the
neighborhood of values, they perform local
smoothing.
• smooth by bin means, smooth by bin median,
smooth by bin boundaries
Approach of binning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing approximately same
number of samples
– Good data scaling
– Managing categorical attributes can be a problem
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Example
• Data : 10,2,19,18,20,18,25,28,22
• Bin size : 3
• By Bin mean, By bean median and by bin
boundaries
Regression
• Regression is a data mining function that predicts a
number. Age, weight, distance, temperature, income, or
sales could all be predicted using regression techniques. For
example, a regression model could be used to predict
children's height, given their age, weight, and other factors.
• A regression task begins with a data set in which the target
values are known. For example, a regression model that
predicts children's height could be developed based on
observed data for many children over a period of time. The
data might track age, height, weight, developmental
milestones, family history, and so on. Height would be the
target, the other attributes would be the predictors, and
the data for each child would constitute a case.
linear regression
• The simplest form of regression to visualize is
linear regression with a single predictor.
• A linear regression technique can be used if
the relationship between x and y can be
approximated with a straight line.
In a linear regression scenario with a single
predictor (y = θ2x + θ1), the regression
parameters (also called coefficients) are:

The slope of the line (θ2) — the angle between

a data point and the regression line
and
The y intercept (θ1) — the point
where x crosses the y axis (x = 0)
• In the model build (training) process, a
regression algorithm estimates the value of
the target as a function of the predictors for
each case in the build data. These
relationships between predictors and target
are summarized in a model, which can then be
applied to a different data set in which the
target values are unknown.
Application
• trend analysis
• business planning
• marketing
• financial forecasting
• time series prediction
• biomedical and drug response modeling
• environmental modeling.
What is cluster
• 1. A cluster is a subset of objects which are
“similar”
• 2. A subset of objects such that the distance
between any two objects in the cluster is less
than the distance between any object in the
cluster and any object not located inside it.
• 3. A connected region of a multidimensional
space containing a relatively high density of
objects
• Clustering is a process of partitioning a set of data (or
objects) into a set of meaningful sub-classes, called
clusters.
• Help users understand the natural grouping or
structure in a data set.
• Clustering: unsupervised classification: no predefined
classes.
• Used either as a stand-alone tool to get insight into
data distribution or as a preprocessing step for other
algorithms.
• Moreover, data compression, outliers detection,
understand human concept formation.
Application Area
• Economic Science (especially market research). •
WWW: • Document classification
Cluster Weblog data to discover groups of
similar access patterns
• Pattern Recognition.
• Spatial Data Analysis:
create thematic maps in GIS by clustering
feature spaces
• Image Processing
Cluster
Data Integration
• Motivation
Many databases and sources of data that need to be
integrated to work together
Almost all applications have many sources of data
• Data Integration
Is the process of integrating data from multiple sources
and probably have a single view over all these sources
And answering queries using the combined information
• Integration can be physical or virtual
• Physical: Coping the data to warehouse
• Virtual: Keep the data only at the sources
HETEROGENEITY PROBLEMS
• Source Type Heterogeneity
Systems storing the data can be different
• Communication Heterogeneity
Some systems have web interface others do not
Some systems allow direct query language others
offer APIs
• Schema Heterogeneity
The structure of the tables storing the data can be
different (even if storing the same data)
• Data Type Heterogeneity
Storing the same data (and values) but with different data
types
E.g., Storing the phone number as String or as Number
• Value Heterogeneity
Same logical values stored in different ways E.g., ‘Prof’,
‘Prof.’, ‘Professor’
• Semantic Heterogeneity
Same values in different sources can mean different
things • E.g., Column ‘Title’ in one database means ‘Job
Title’ while in another database it means ‘Person Title’
Handling Redundancy in Data
Integration
Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object
may have different names in different databases
– Derivable data: One attribute may be a “derived”
attribute in another table
• Redundant attributes may be able to be detected by
correlation analysis
-Careful integration of the data from multiple sources
may help reduce/avoid redundancies and inconsistencies
and improve mining speed and quality
Correlation Analysis (numerical Data)

• Correlation coefficient (also called Pearson’s product

moment coefficient)
• where n is the number of tuples, and 𝐴ҧ 𝑎𝑛𝑑 𝐵ത are the
respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(AB) is
the sum of the AB cross‐product.
• If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
• rA,B = 0: independent;
• rA,B < 0: negatively correlated
Correlation Analysis (Categorical Data)
• Χ2 (chi‐square) test

• The larger the Χ2 value, the more likely the

variables are related
• The cells that contribute the most to the Χ2 value
are those whose actual count is very different
from the expected count
Data Transformation
• In data transformation, the data are transformed
or consolidated into forms appropriate or mining.
Data transformation can involve the following
• Data transformation tasks:
– Smoothing
– Normalization
– Attribute construction
– Aggregation
– Discretization
– Generalization(Concept hierarchy)
• Normalization
– the attribute data are scaled so as to fall within a small
specified range, such as -1.0 to 1.0, 0.0 to 1.0
• Attribute construction (or feature construction)
– new attributes are constructed and added from the given
set of attributes to help the mining process.
• Aggregation
– summary or aggregation operations are applied to the data.
– For example, the daily sales data may be aggregated so as
to compute monthly and annual total amounts.
• Discretization
– Dividing the range of a continuous attribute into intervals
– For example, values for numerical attributes, like age, may
be mapped to higher-level concepts, like youth, middle aged,
and senior.
• Generalization
– where low-level or “primitive” (raw) data are replaced by
higher-level concepts through the use of concept hierarchies.
– For example, categorical attributes, like street, can be
generalized to higher-level concepts, like city or country.
Normalization
• An attribute is normalized by scaling its values
so that they fall within a small specified
range, such as 0.0 to 1.0.
• Normalization is particularly useful for
classification algorithms involving
– neural networks
– distance measurements such as nearest-
neighbor classification and clustering.
• Normalization methods
– Min-max normalization
– z-score normalization
– Normalization by decimal scaling
Min-max normalization
• Min-max normalization : performs a linear
transformation on the original data.
– minA and maxA are the minimum and maximum values of an
attribute, A.
Min-max normalization maps a value, v, of A to v in the range
[new_minA, new_maxA] by computing:
• Let income range $12,000 to $98,000
normalized to [0.0, 1.0].
• Then $73,000 is mapped to

Apply min-max normalization

Marks: 8,10,15,20
z-score normalization

Note that normalization can change the original data

quite a bit, especially the z-score method.
Apply z-score normalization :
marks :8,10,15,20
Decimal scaling
Suppose that the recorded values of A range
from -986 to 917.
The maximum absolute value of A is 986. To
normalize by decimal scaling, we therefore
divide each value by 1000 (i.e., j = 3) so that
-986 normalizes to -0.986 and 917 normalizes to
0.917.
Attribute construction
(feature construction)
– new attributes are constructed from the given
attributes and added in order to help improve the
accuracy and understanding of structure in high-
dimensional data.
- For example, given the two features height and
weight it might be advantageous to construct a
feature body mass index (BMI), which is expressed
by weight ÷ height2 .
-By attribute construction can discover missing
information.
Data aggregation
• Data aggregation is a type of data and information
mining process where data is searched, gathered and
presented in a report-based, summarized format to
achieve specific business objectives or processes
and/or conduct human analysis.
• Data aggregation is any process in which information is
gathered and expressed in a summary form, for
purposes such as statistical analysis. A common
aggregation purpose is to get more information about
particular groups based on specific variables such as
age, profession, or income.
• Data aggregation may be performed manually or
through specialized software.
• Data aggregation is a component of business intelligence (BI)
solutions. Data aggregation personnel or software search databases
find relevant search query data and present data findings in a
summarized format that is meaningful and useful for the end user
or application.
• Data cubes store multidimensional aggregated information.
• Data cubes provide fast access to precomputed, summarized data
thereby benefiting on-line analytical processing as well as
datamining.
Data Reduction
• Why data reduction?
A database/data warehouse may store terabytes
of data Complex data analysis/mining may take
a very long time to run or processing on the
complete data set
• Data reduction : Obtain a reduced
representation of the data set that is much
smaller in volume but yet produce the same
(or almost the same) analytical results
How?
• Reducing number of attributes
• Reducing number of attribute values
• Reducing number of tuples
Data Reduction
• Data reduction
• a) Data cube aggregation
• b) Attribute subset selection
• c) Dimensional reduction .
• d) Data Sampling.
• e) Numerosity reduction
• f) Discretization and concept hierarchy
generation
attribute subset selection
• Why attribute subset selection
• Data sets for analysis may contain hundreds of
attributes, many of which may be irrelevant to
the mining task or redundant.
For example,
• if the task is to classify customers as to whether
or not they are likely to purchase a popular new
CD at AllElectronics when notified of a sale,
attributes such as the customer’s telephone
number are likely to be irrelevant, unlike
attributes such as age or music_taste.
`
• Using domain expert to pick out some of the
useful Attributes Sometimes this can be a difficult
and time-consuming task, especially when the
behavior of the data is not well known.
• Leaving out relevant attributes or keeping
irrelevant attributes result in discovered patterns
of poor quality.
• In addition, the added volume of irrelevant or
redundant attributes can slow down the mining
process.
Attribute subset selection (feature selection):

• Attribute subset selection (feature selection):

• – Reduce the data set size by removing irrelevant
or redundant attributes
Goal:
• select a minimum set of features (attributes) such
that the probability distribution of different
classes given the values for those features is as
close as possible to the original distribution given
the values of all features
• It reduces the number of attributes appearing in
the discovered patterns, helping to make the
patterns easier to understand.
How can we find a ‘good’ subset of the
original attributes?
For n attributes, there are 2n possible subsets.
• An exhaustive search for the optimal subset of
attributes can be prohibitively expensive, especially as
n increase.
• Heuristic methods that explore a reduced search space
are commonly used for attribute subset selection.
• These methods are typically greedy in that, while
searching through attribute space, they always make
what looks to be the best choice at the time.
• Such greedy methods are effective in practice and may
come close to estimating an optimal solution.
Heuristic methods
• 1 Step-wise forward selection
• 2 Step-wise backward elimination
• 3 Combining forward selection and backward
elimination
• 4 Decision-tree induction
The “best” (and “worst”) attributes are typically
determined using:
-- the tests of statistical significance, which assume
that the attributes are independent of one another.
– the information gain measure used in building
decision trees for classification.
Stepwise forward selection
• – The procedure starts with an empty set of attributes as the
reduced set.
• – First: The best single-feature is picked.
• – Next: At each subsequent iteration or step, the best of the
remaining original attributes is added to the set.
Stepwise backward elimination:

• – The procedure starts with the full set of

attributes.
• – At each step, it removes the worst attribute
remaining in the set.
Combining forward selection and
backward elimination:
• – The stepwise forward selection and
backward elimination methods can be
combined
• – At each step, the procedure selects the best
attribute and removes the worst from among
the remaining attributes.
Decision tree induction:
– Decision tree algorithms, such as ID3, C4.5, and CART, were
originally intended for classification.
• Decision tree induction constructs a flowchart-like
structure where each internal (nonleaf) node denotes a
test on an attribute, each branch corresponds to an
outcome of the test and each external (leaf) node denotes
a class prediction.
• At each node, the algorithm chooses the “best” attribute to
partition the data into individual classes.
• When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data.
• All attributes that do not appear in the tree are assumed to
be irrelevant.
• Decision tree induction
Data Compression
• String compression: There are extensive theories
and well‐tuned algorithms
– Typically lossless
– But only limited manipulation is possible without
expansion
• Audio/video compression Typically lossy
compression, with progressive refinement
– Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
• Time sequence is not audio Typically short
and vary slowly with time
Discretization
Three types of attributes:
• Nominal —values from an unordered set, e.g., color, profession
• Ordinal —values from an ordered set, e.g., military or academic
rank
• Continuous —real numbers, e.g., integer or real numbers

Discretization:
• Divide the range of a continuous attribute into intervals
• Some classification algorithms only accept categorical attributes.
• Reduce data size by discretization
• Prepare for further analysis
Discretization
Typical methods:
1 Binning
2 Clustering analysis
3 Interval merging by χ2Analysis
Architecture of a typical data mining
system

Graphical user interface

Pattern evaluation

Knowledge base

Data mining engine

Database or data warehouse server

Data cleansing
Data Integration Filtering

Database Data warehouse

Data Mining Task Primitives
• Misconception: Data mining systems can
autonomously dig out all of the valuable
knowledge from a given large database, without
human intervention.

• If there was no user intervention then the system

would uncover a large set of patterns that may
even surpass the size of the database. Hence,
user interference is required.

• This user communication with the system is

provided by using a set of data mining primitives.
Data Mining Task Primitives
• Each user will have a data mining task in mind, that is,
some form of data analysis that he or she would like to
have performed.
• A data mining task can be specified in the form of a
data mining query, which is input to the data mining
system. A data mining query is defined in terms of data
mining task primitives
• These primitives allow the user to interactively
communicate with the data mining system during
discovery in order to direct the mining process, or
examine the findings from different angles or depths.
Data Mining Task Primitives
Data mining primitives :

• Task Relevant Data

• Kinds of knowledge to be mined

• Background knowledge

• Interestingness measure

• Presentation and visualization of discovered patterns

Task Relevant Data

• This specifies the portions of the database or the

set of data in which the user is interested.
• This includes :
o Database or data warehouse name „
o Database tables or data warehouse cubes „
o Condition for data selection
o Relevant attributes or dimensions
o „Data grouping criteria
Example:
• If a data mining task is to study associations between items
frequently purchased at AllElectronics by customers in Canada, the
task relevant data can be specified by providing the following
information:
– Name of the database or data warehouse to be used (e.g.,
AllElectronics_db)
– Names of the tables or data cubes containing relevant data (e.g.,
item, customer, purchases and items_sold)
– Conditions for selecting the relevant data (e.g., retrieve data
pertaining to purchases made in Canada for the current year)
– The relevant attributes or dimensions (e.g., name and price
from the item table and income and age from the customer table)
Kind of knowledge to be mined
• It is important to specify the knowledge to be mined, as this
determines the data mining function to be performed.
• User can also provide pattern templates. Also called
metapatterns or metarules or metaqueries.
• The data mining functions to be performed :
o Characterization
o Discrimination „
o Association „
o Classification/prediction „
o Clustering „
o Outlier analysis
o Other data mining tasks
Background knowledge
• This knowledge about the domain to be mined is
useful for guiding the knowledge discovery
process and for evaluating the patterns found
• Concept hierarchy: is a powerful form of
background knowledge.
o Four major types of concept hierarchies:
schema hierarchies
set-grouping hierarchies
operation-derived hierarchies
rule-based hierarchies
Schema hierarchies
• Schema hierarchy is the total or partial
order among attributes in the database
schema.
• May formally express existing semantic
relationships between attributes.
• Example: location hierarchy
street < city < province/state < country
Set-grouping hierarchies
• Organizes values for a given attribute into groups or sets or range of
values.

• Total or partial order can be defined among groups.

• Used to refine schema-defined hierarchies.

• Typically used for small sets of object relationships.

• Example: Set-grouping hierarchy for age

{young, middle_aged, senior}  all (age)
{20….29}  young
{40….59}  middle_aged
{60….89}  senior
Operation-derived hierarchies
• Operation-derived:
based on operations specified
operations may include
decoding of information-encoded strings
information extraction from complex data
objects
data clustering
Example: URL or email address
[email protected] gives login name < dept. < univ. < country
Rule-based hierarchies
• Rule-based:
Occurs when either whole or portion of a concept hierarchy is
defined as a set of rules and is evaluated dynamically based on
current database data and rule definition

• Example: Following rules are used to categorize items as

low_profit, medium_profit and high_profit_margin.
low_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-
P2)<50)
medium_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-
P2)≥50)^((P1-P2)≤250)
high_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-
P2)>250)
Interestingness measure
• Used to confine the number of uninteresting patterns
returned by the process.

• Based on the structure of patterns and statistics underlying

them.

• patterns not meeting the threshold are not presented to the

user.

• Objective measures of pattern interestingness:

simplicity
certainty (confidence)
utility (support)
novelty
Simplicity

a patterns interestingness is based on its

overall simplicity for human
comprehension.
Example: Rule length is a simplicity
measure
1>A,B,C => D
2> A=>B,C
A=>B
Support

• Utility (support)
– usefulness of a pattern
support (A=>B) = # tuples containing both A and B
total # of tuples
A support of 30% for the rule means that 30%
of all customers purchased both a computer
and software.
Every association rule has a support and a confidence.
“The support is the percentage of transactions that demonstrate the rule.”
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 1, 3, 5.
2: 1, 8, 14, 17, 12.
3: 4, 6, 8, 12, 9, 104.
4: 2, 1, 8.
support {8,12} = 2 (,or 50% ~ 2 of 4 customers)
support {1, 5} = 1 (,or 25% ~ 1 of 4 customers )
support {1} = 3 (,or 75% ~ 3 of 4 customers)

• An itemset is called frequent if its support is equal or greater than an

agreed upon minimal value – the support threshold
Confidence
• Certainty (confidence)
Assesses the validity or trustworthiness of a pattern.
confidence is a certainty measure
• Confidence is an indication of how often the rule has been
found to be true.
confidence (A=>B) = # tuples containing both A and B
# tuples containing A
A confidence of 85% for the rule buys(X,
“computer”)=>buys(X,“software”) means that 85% of all
customers who purchased a computer also bought software
• Confidence(X=>Y) = support(X,Y)
support(X)
• Association rules that satisfy both the
minimum confidence and support threshold
are referred to as strong association rules
Novelty
• Novelty
Patterns contributing new information to
the given pattern set are called novel
patterns
removing redundant patterns is a strategy
for detecting novelty.
Presentation and visualization
• For data mining to be effective, data mining
systems should be able to display the
discovered patterns in multiple forms, such
as rules, tables, crosstabs (cross-tabulations),
pie or bar charts, decision trees, cubes, or
other visual representations.

• User must be able to specify the forms of

presentation to be used for displaying the
discovered patterns.

HCIA AI Practice Exam All
No ratings yet
HCIA AI Practice Exam All
64 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Mining
No ratings yet
Data Mining
40 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
14.M.E Big Data
No ratings yet
14.M.E Big Data
89 pages
Latest Data Mining Lab Manual
No ratings yet
Latest Data Mining Lab Manual
74 pages
10.1007@978 981 13 7123 3
No ratings yet
10.1007@978 981 13 7123 3
628 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
6 pages
Introduction To Data Science and Machine Learning
No ratings yet
Introduction To Data Science and Machine Learning
23 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Normalization
No ratings yet
Normalization
35 pages
CH 3
No ratings yet
CH 3
68 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Machine Learning Case Studies v1 PDF
100% (1)
Machine Learning Case Studies v1 PDF
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Expense Manager Mobile Application
No ratings yet
Expense Manager Mobile Application
8 pages
1-Advanced Logic Synthesis-Springer International Publishing (2018)
No ratings yet
1-Advanced Logic Synthesis-Springer International Publishing (2018)
236 pages
Intro To Machine Learning With PyTorch
No ratings yet
Intro To Machine Learning With PyTorch
48 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data Analysis On Car Sales Data: Submitted To Dr. Mahesh Ramalingam
No ratings yet
Data Analysis On Car Sales Data: Submitted To Dr. Mahesh Ramalingam
16 pages
Audisankara: Machine Learning
No ratings yet
Audisankara: Machine Learning
32 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Resume PDF
No ratings yet
Resume PDF
1 page
MAS2017
No ratings yet
MAS2017
235 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
DS Course Curriculum
No ratings yet
DS Course Curriculum
19 pages
Chapter 1. Array and Cluster: Cluster
No ratings yet
Chapter 1. Array and Cluster: Cluster
14 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
CH 2
No ratings yet
CH 2
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
DM Day3 Preprocessing A F24
No ratings yet
DM Day3 Preprocessing A F24
85 pages
Uber
No ratings yet
Uber
46 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
An Anomaly Detection Framework For Digital Twin Driven Cyber-Physical Systems
No ratings yet
An Anomaly Detection Framework For Digital Twin Driven Cyber-Physical Systems
11 pages
Cluster Validation
No ratings yet
Cluster Validation
47 pages
Unit 2
No ratings yet
Unit 2
37 pages
Lecture 123
No ratings yet
Lecture 123
20 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Sir C.R.Reddy College of Engineering, Eluru Department of Information Technology Course Handout
No ratings yet
Sir C.R.Reddy College of Engineering, Eluru Department of Information Technology Course Handout
12 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Irs Cs523pe
No ratings yet
Irs Cs523pe
15 pages
Ai Chapter 5
No ratings yet
Ai Chapter 5
2 pages
Customer Profile Cluster Techiques
No ratings yet
Customer Profile Cluster Techiques
13 pages
Week2 2
No ratings yet
Week2 2
25 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
28 pages
MC Data Science23
No ratings yet
MC Data Science23
26 pages
Final Report End
No ratings yet
Final Report End
92 pages
Internship Report
No ratings yet
Internship Report
23 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
Sem 4th Syllabus
No ratings yet
Sem 4th Syllabus
18 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
HU14 CISC 520 Data Analytics Final Project
No ratings yet
HU14 CISC 520 Data Analytics Final Project
6 pages
DP
No ratings yet
DP
44 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Complete Data Science, Machine Learning, DL, NLP Bootcamp - Udemy Business
No ratings yet
Complete Data Science, Machine Learning, DL, NLP Bootcamp - Udemy Business
25 pages

3 Ravi

Uploaded by

3 Ravi

Uploaded by

Data Pre-processing

Prof. Ravi Patel

• Data cleaning tasks

The slope of the line (θ2) — the angle between

• Correlation coefficient (also called Pearson’s product

• The larger the Χ2 value, the more likely the

Apply min-max normalization

Note that normalization can change the original data

• Attribute subset selection (feature selection):

• – The procedure starts with the full set of

Graphical user interface

Data mining engine

Database or data warehouse server

Database Data warehouse

• If there was no user intervention then the system

• This user communication with the system is

• Task Relevant Data

• Kinds of knowledge to be mined

• Presentation and visualization of discovered patterns

• This specifies the portions of the database or the

• Total or partial order can be defined among groups.

• Used to refine schema-defined hierarchies.

• Typically used for small sets of object relationships.

• Example: Set-grouping hierarchy for age

• Example: Following rules are used to categorize items as

• Based on the structure of patterns and statistics underlying

• patterns not meeting the threshold are not presented to the

• Objective measures of pattern interestingness:

a patterns interestingness is based on its

• An itemset is called frequent if its support is equal or greater than an

• User must be able to specify the forms of

You might also like