0% found this document useful (0 votes)

59 views66 pages

Data Preprocessing

Data preprocessing involves cleaning data by handling missing values, smoothing noisy data, and resolving inconsistencies. Major techniques for data cleaning include filling in missing values manually or using statistical measures, binning values to reduce noise, and detecting discrepancies using domain knowledge or data rules. Data preprocessing also includes reducing data through dimensionality reduction methods like wavelet transforms and principal component analysis, as well as attribute subset selection algorithms to minimize irrelevant attributes. The goal of preprocessing is to improve data quality for mining tasks.

Uploaded by

TuLbig E. Winnower

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views66 pages

Data Preprocessing

Uploaded by

TuLbig E. Winnower

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Data Preprocessing

Why Preprocess the Data?

Today’s real-world databases are highly susceptible to
noisy, missing, and inconsistent data due to their typically
huge size and their likely origin from multiple,
heterogenous sources.

Low-quality data will lead to low-quality mining results.

Major Tasks in Data Preprocessing

Data Data Data Data

cleaning Integration Reduction Transformation
Major Tasks in Data Preprocessing
Data Cleaning
Data Cleaning
Data cleaning routines
attempt to fill in missing
values, smooth out noise
while identifying outliers,
and correct inconsistencies
in the data.
Methods on
Filling in
Missing Values
Ignore the tuple
This method is not very effective, unless the tuple contains
several attributes with missing values. It is especially
poor when the percentage of missing values per attribute
varies considerably.

By ignoring the tuple, we do not make use of the remaining

attributes’ values in the tuple. Such data could have been
useful to the task at hand.
Fill in the missing value manually
You basically find the missing value and manually input them
into the database to fix the tuple.

In general, this approach is time consuming and may not be

feasible given a large data set with many missing values.
Use a global constant to ﬁll in the missing value
Replace all missing attribute values by the same constant
such as a label like “Unknown” or −∞.

If missing values are replaced by, say, “Unknown,” then the

mining program may mistakenly think that they form an
interesting concept, since they all have a value in common.
Hence, although this method is simple, it is not foolproof.
Use a measure of central tendency for the attribute
In this method, we find the central tendency of all the data
and replace the missing attribute with this value.

For normal (symmetric) data distributions, the mean can be

used, while skewed data distribution should employ the
median.
Use the attribute mean or median for all samples
belonging to the same class as the given tuple
In this method, we use once again use a measure of central
tendency to obtain a data that we can substitute for the
missing value.

The difference is that we base the measure of central

tendency according to the category that the tuple is in.
Use the most probable value to ﬁll in the missing value
In this method, we predict the missing values by using an
algorithm and fill it in accordingly.

This may be determined with regression, inference-based

tools using a Bayesian formalism, or decision tree
induction.
Dealing with
Noisy Data
Binning
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.

The sorted values are distributed into a number of

“buckets,” or bins. Because binning methods consult the
neighborhood of values, they perform local smoothing.
Binning
Regression
Data smoothing can also be done by regression, a technique
that conforms data values to a function.

Linear regression involves finding the “best” line to fit

two attributes (or variables) so that one attribute can be
used to predict the other.

Multiple linear regression is an extension of linear

regression, where more than two attributes are involved and
the data are fit to a multidimensional surface.
Outlier analysis
Outliers may be detected by
clustering, for example,
where similar values are
organized into groups, or
“clusters.” Intuitively,
values that fall outside of
the set of clusters may be
considered outliers.
Fixing Data
Inconsistencies
Discrepancy Detection Tools
Data scrubbing tools use simple domain knowledge (e.g.,
knowledge of postal addresses and spell-checking) to detect
errors and make corrections in the data.

Data auditing tools find discrepancies by analyzing the data

to discover rules and relationships, and detecting data that
violate such conditions.
Data Transformation Tools
Data migration tools allow simple transformations to be
specified such as to replace the string “gender” by “sex”.

ETL (extraction/transformation/loading) tools allow users to

specify transforms through a graphical user interface (GUI)
Data Integration
Data Integration

Redundancy
Entity Data warfare
and Tuple
Identiﬁcation Detection and
Correlation Duplication
Problem backbone
Analysis
Data Reduction
Data Reduction
Data reduction is used to obtain a reduced representation of
the data set that is much smaller in volume, yet closely
maintains the integrity of the original data.

That is, mining on the reduced data set should be more

efficient yet produce the same analytical results.
Data Reduction Strategies
Dimensionality Reduction Numerosity Reduction

Principal Regression Histogram

Wavelet
components
Transforms
analysis
Clustering Sampling
Attribute
subset Data cube
selection aggregation
Data Reduction
Dimensionality reduction is the process of reducing the
number of random variables or attributes under
consideration.

Numerosity reduction techniques replace the original data

volume by alternative, smaller forms of data representation.
These techniques may be parametric or nonparametric.
Dimensionality
Reduction
Wavelet Transforms
The discrete wavelet transform (DWT) is a linear signal
processing technique that, when applied to a data vector X,
transforms it to a numerically different vector, X1, of
wavelet coefficients.

When applying this technique to data reduction, we consider

each tuple as an n-dimensional data vector, thatis, X =
(x1,x2,...,xn), depicting n measurements made on the tuple
from n database attributes.
Applying a Discrete Wavelet Transform
1. The length, L, of the input data vector must be an
integer power of 2. This condition can be met by padding
the data vector with zeros as necessary (L ≥ n).
2. Each transform involves applying two functions. The first
applies some data smoothing, such as a sum or weighted
average. The second performs a weighted difference, which
acts to bring out the detailed features of the data.
Applying a Discrete Wavelet Transform
3. The two functions are applied to pairs of data points in
X, that is, to all pairs of measurements (x2i,x2i+1).
This results in two data sets of length L/2.
4. The two functions are recursively applied to the data
sets obtained in the previous loop, until the resulting
data sets obtained are of length 2.
5. Selected values from the data sets obtained in the
previous iterations are designated the wavelet
coefficients of the transformed data.
Principal Components Analysis
Principal components analysis searches for k n-dimensional
orthogonal vectors that can best be used to represent the
data, where k ≤ n.
The original data are thus projected onto a much smaller
space, resulting in dimensionality reduction.
PCA often reveals relationships that were not previously
suspected and thereby allows interpretations that would not
ordinarily result
Principal Components Analysis Procedure
1. The input data are normalized, so that each attribute
falls within the same range. This step helps ensure that
attributes with large domains will not dominate
attributes with smaller domains.
2. PCA computes k orthonormal vectors that provide a basis
for the normalized input data. These are unit vectors
that each point in a direction perpendicular to the
others. These vectors are referred to as the principal
components. The input data are a linear combination of
the principal components.
Principal Components Analysis Procedure
3. The principal components are sorted in
order of decreasing “significance” or
strength. The principal components
essentially serve as a new set of axes
for the data, providing important
information about variance. That is,
the sorted axes are such that the
first axis shows the most variance
among the data, the second axis shows
the next highest variance, and so on.
Principal Components Analysis Procedure
4. Because the components are sorted in decreasing order of
“significance,” the data size can be reduced by
eliminating the weaker components, that is, those with
low variance. Using the strongest principal components,
it should be possible to reconstruct a good approximation
of the original data.
Attribute Subset Selection
Attribute subset selection reduces the data set size by
removing irrelevant or redundant attributes (or dimensions).

The goal of attribute subset selection is to find a minimum

set of attributes such that the resulting probability
distribution of the data classes is as close as possible to
the original distribution obtained using all attributes.
Attribute Subset Selection
For n attributes, there are 2n possible subsets. An
exhaustive search for the optimal subset of attributes can
be prohibitively expensive, especially as n and the number
of data classes increase.

Therefore, heuristic methods that explore a reduced search

space are commonly used for attribute subset selection.
Heuristic Methods Techniques
1. Stepwise forward selection: The procedure starts with an
empty set of attributes as the reduced set. The best of
the original attributes is determined and added to the
reduced set. At each subsequent iteration or step, the
best of the remaining original attributes is added to the
set.
2. Stepwise backward elimination: The procedure starts with
the full set of attributes. At each step, it removes the
worst attribute remaining in the set.
Heuristic Methods Techniques
3. Combination of forward selection and backward
elimination: The stepwise forward selection and backward
elimination methods can be combined so that, at each
step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
Heuristic Methods Techniques
4. Decision tree induction: Decision tree induction
constructs a flowchart like structure where each internal
(nonleaf) node denotes a test on an attribute, each
branch corresponds to an outcome of the test, and each
external (leaf) node denotes a class prediction. At each
node, the algorithm chooses the “best” attribute to
partition the data into individual classes.
Heuristic Methods Techniques
Numerosity
Reduction
Linear Regression Models
Regression and log-linear
models can be used to
approximate the given data.
In (simple) linear
regression, the data are
modeled to fit a straight
line.
Simple Linear Regression
For example, a random variable, y (called a response
variable), can be modeled as a linear function of another
random variable, x (called a predictor variable), with the
equation where the variance of y is assumed to be constant.

The coefficients, w and b (called regression coefficients),

specify the slope of the line and the y-intercept,
respectively.
Multiple linear regression
Multiple linear regression is an extension of (simple)
linear regression, which allows a response variable, y, to
be modeled as a linear function of two or more predictor
variables
Histograms
Histograms use binning to approximate data distributions and
are a popular form of data reduction.
Partitioning Rules
● Equal-width: In an equal-width histogram, the width of
each bucket range is uniform.
● Equal-frequency (or equal-depth): In an equal-frequency
histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant.
Clustering
Clustering techniques consider
data tuples as objects.

They partition the objects into

groups, or clusters, so that
objects within a cluster are
“similar” to one another and
“dissimilar” to objects in
other clusters.
Clustering
The “quality” of a cluster may be represented by its
diameter, the maximum distance between any two objects in
the cluster.

Centroid distance is an alternative measure of cluster

quality and is defined as the average distance of each
cluster object from the cluster centroid
Sampling
Sampling can be used as a data reduction technique because
it allows a large data set to be represented by a much
smaller random data sample (or subset).

Suppose that a large dataset, D, contains N tuples.

Common ways to sample for data reduction
Simple random sample without replacement (SRSWOR) of size s:

This is created by drawing s of the N tuples from D (s < N),

where the probability of drawing any tuple in D is 1/N, that
is, all tuples are equally likely to be sampled.
Common ways to sample for data reduction
Simple random sample with replacement (SRSWR) of size s:

This is similar to SRSWOR, except that each time a tuple is

drawn from D, it is recorded and then replaced. That is,
after a tuple is drawn, it is placed back in D so that it
may be drawn again.
SRSWOR / SRSWR Example
Common ways to sample for data reduction
Cluster sample:

If the tuples in D are grouped into M mutually disjoint

“clusters,” then an SRS of s clusters can be obtained, where
s < M.
Common ways to sample for data reduction
Stratified sample:

If D is divided into mutually

disjoint parts called strata,
a stratified sample of D is
generated by obtaining an SRS
at each stratum. This helps
ensure a representative
sample, especially when the
data are skewed.
Data Cube Aggregation
Data cubes store
multidimensional aggregated
information. Each cell holds
an aggregate data value,
corresponding to the data
point in multidimensional
space.
Data
Transformation
Data Transformation Strategies
● Smoothing, which works to remove noise from the data.
Techniques include binning, regression, and clustering.
● Attribute construction (or feature construction), where
new attributes are constructed and added from the given
set of attributes to help the mining process.
● Aggregation, where summary or aggregation operations are
applied to the data. This step is typically used in
constructing a data cube for data analysis at multiple
abstraction levels.
Data Transformation Strategies
● Normalization, where the attribute data are scaled so as
to fall within a smaller range, such as −1.0 to 1.0, or
0.0 to 1.0.
● Discretization, where the raw values of a numeric
attribute (e.g., age) are replaced by interval labels
(e.g., 0–10, 11–20, etc.) or conceptual labels (e.g.,
youth, adult, senior). The labels, in turn, can be
recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric
attribute.
Data Transformation Strategies
● Concept hierarchy generation for nominal data, where
attributes such as street can be generalized to
higher-level concepts, like city or country. Many
hierarchies for nominal attributes are implicit within
the database schema and can be automatically defined at
the schema definition level.
Key Points
Key Points
● Data cleaning routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct
inconsistencies in the data. Data cleaning is usually
performed as an iterative two-step process consisting of
discrepancy detection and data transformation.
Key Points
● Data integration combines data from multiple sources to
form a coherent data store. The resolution of semantic
heterogeneity, metadata, correlation analysis, tuple
duplication detection, and data conflict detection
contribute to smooth data integration
Key Points
● Data reduction techniques obtain a reduced representation
of the data while minimizing the loss of information
content. These include methods of dimensionality
reduction, numerosity reduction, and data compression.
● Dimensionality reduction reduces the number of random
variables or attributes under consideration.
● Numerosity reduction methods use parametric or
nonparametric models to obtain smaller representations of
the original data
Key Points
● Data transformation routines convert the data into
appropriate forms for mining. For example, in
normalization, attribute data are scaled so as to fall
within a small range such as 0.0 to 1.0. Other examples
are data discretization and concept hierarchy generation.
The End

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6436)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (997)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1854)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5144)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2126)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (463)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2010)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2788)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4088)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
What You Need To Know About Reefer 1
No ratings yet
What You Need To Know About Reefer 1
9 pages
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Association Rule Learning
No ratings yet
Association Rule Learning
16 pages
Process Mining and Data Stream Mining
No ratings yet
Process Mining and Data Stream Mining
19 pages
Data Integration
No ratings yet
Data Integration
18 pages
Biotech Report Group 3
No ratings yet
Biotech Report Group 3
12 pages
STS - SCIENCE, TECHNOLOGY, A ... Llabus AY2020-2021 Sem 2
No ratings yet
STS - SCIENCE, TECHNOLOGY, A ... Llabus AY2020-2021 Sem 2
7 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
MSC Thesis in Computer Science
100% (4)
MSC Thesis in Computer Science
5 pages
SQL_dbms
No ratings yet
SQL_dbms
20 pages
Machine Learning - 6
No ratings yet
Machine Learning - 6
53 pages
Installing and Registering FSUIPC4 PDF
No ratings yet
Installing and Registering FSUIPC4 PDF
6 pages
RLC Series: Experiment No:-06 DATE: - 08/05/2020
No ratings yet
RLC Series: Experiment No:-06 DATE: - 08/05/2020
6 pages
PROPOSAL Nguyen Thi My Ngoc Lần 2
No ratings yet
PROPOSAL Nguyen Thi My Ngoc Lần 2
13 pages
cipp-e_1
No ratings yet
cipp-e_1
16 pages
1012 Blasting Courses
No ratings yet
1012 Blasting Courses
8 pages
PR Kidszone
No ratings yet
PR Kidszone
21 pages
(Ebook) Hacking Exposed™ Web applications by Joel Scambray, Mike Shema ISBN 9780072224382, 007222438X instant download
100% (1)
(Ebook) Hacking Exposed™ Web applications by Joel Scambray, Mike Shema ISBN 9780072224382, 007222438X instant download
27 pages
Introducing The New Iphone 14 Pro Plus
No ratings yet
Introducing The New Iphone 14 Pro Plus
9 pages
XL Windows 2000 Restore Instruction Dell GX280
No ratings yet
XL Windows 2000 Restore Instruction Dell GX280
4 pages
GitHub - Bither - Bither-Android-Lib - Bither Android Library
No ratings yet
GitHub - Bither - Bither-Android-Lib - Bither Android Library
1 page
Spare Parts List Desktop Printers GX430t - GX43-1X2XXX-XXX (
No ratings yet
Spare Parts List Desktop Printers GX430t - GX43-1X2XXX-XXX (
3 pages
Latex pdf
No ratings yet
Latex pdf
30 pages
Question Text: Correct 1 Points Out of 1
No ratings yet
Question Text: Correct 1 Points Out of 1
4 pages
Instagram Stories and Highlights
No ratings yet
Instagram Stories and Highlights
1 page
Learn HTML - Semantic HTML Cheatsheet - Codecademy
No ratings yet
Learn HTML - Semantic HTML Cheatsheet - Codecademy
2 pages
JS
No ratings yet
JS
21 pages
Moving Forth - Part 6
No ratings yet
Moving Forth - Part 6
1 page
B.Tech Minor Project Report
No ratings yet
B.Tech Minor Project Report
34 pages
Installation Manual SMA STP 3.0 3AV 40
No ratings yet
Installation Manual SMA STP 3.0 3AV 40
100 pages
ZXC10 BSSB RSSI Alarm Troubleshooting Guide R1 1
No ratings yet
ZXC10 BSSB RSSI Alarm Troubleshooting Guide R1 1
41 pages
Ibex101 Sep2020 SW - Env PDF
No ratings yet
Ibex101 Sep2020 SW - Env PDF
28 pages
Desktop Application - An Application That Runs Stand-Alone in A Desktop or Laptop Computer
No ratings yet
Desktop Application - An Application That Runs Stand-Alone in A Desktop or Laptop Computer
1 page
EET-3404-DIGITAL-CIRCUIT-DESIGN-
No ratings yet
EET-3404-DIGITAL-CIRCUIT-DESIGN-
6 pages
eula
No ratings yet
eula
2 pages
Assignment
No ratings yet
Assignment
5 pages
P3. - Freewave z9-PE
No ratings yet
P3. - Freewave z9-PE
20 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Data Preprocessing

Why Preprocess the Data?

Low-quality data will lead to low-quality mining results.

Data Data Data Data

By ignoring the tuple, we do not make use of the remaining

In general, this approach is time consuming and may not be

If missing values are replaced by, say, “Unknown,” then the

For normal (symmetric) data distributions, the mean can be

The difference is that we base the measure of central

This may be determined with regression, inference-based

The sorted values are distributed into a number of

Linear regression involves finding the “best” line to fit

Multiple linear regression is an extension of linear

Data auditing tools find discrepancies by analyzing the data

ETL (extraction/transformation/loading) tools allow users to

That is, mining on the reduced data set should be more

Principal Regression Histogram

Numerosity reduction techniques replace the original data

When applying this technique to data reduction, we consider

The goal of attribute subset selection is to find a minimum

Therefore, heuristic methods that explore a reduced search

The coefficients, w and b (called regression coefficients),

They partition the objects into

Centroid distance is an alternative measure of cluster

Suppose that a large dataset, D, contains N tuples.

This is created by drawing s of the N tuples from D (s < N),

This is similar to SRSWOR, except that each time a tuple is

If the tuples in D are grouped into M mutually disjoint

If D is divided into mutually

You might also like