0% found this document useful (0 votes)

34 views

Class-Data Preprocessing-II

The document discusses data preprocessing techniques for data mining, including types of data like records, transactions, graphs and sequences. It covers tasks in data preprocessing like data cleaning, integration and transformation to handle issues like missing values, noisy data, and inconsistencies. Dimensionality reduction techniques like principal component analysis are also introduced to reduce the volume of data while maintaining similar analytical results.

Uploaded by

f20201207

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Class-Data Preprocessing-II

Uploaded by

f20201207

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

CS F415: Data Mining

Yashvardhan Sharma

30-Jan-24 C1S F415 1

Today’s Outline
•Data Preprocessing
• Types of Data
• Data Quality
• Steps in Data Preprocessing
• Dimensionality Reduction - PCA

30-Jan-24 C1S F415 2

Types of data sets
• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
30-Jan-24 CS F415 3
Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
30-Jan-24 10

CS F415 4
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

30-Jan-24 CS F415 5
Data Matrix
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a
distinct attribute

• Such data set can be represented by an m by n matrix, where

there are m rows, one for each object, and n columns, one for
each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

30-Jan-24 CS F415 6
Document – term matrix
• Each document becomes a ‘term’ vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
30-Jan-24 CS F415 7
Graph Data
• Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
5 1 <li>
<a href="papers/papers.html#aaaa">
2 Parallel Solution of Sparse Linear System of Equations </a>
<li>
5 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers

30-Jan-24 CS F415 8
Chemical Data
• Benzene Molecule: C6H6

30-Jan-24 CS F415 9
Ordered Data
• Sequences of transactions

Items/Events

An element of
the sequence
30-Jan-24 CS F415 10
Ordered Data
• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

30-Jan-24 CS F415 11
Ordered Data
• Spatio-Temporal Data

Average
Monthly
Temperature of
land and ocean

30-Jan-24 CS F415 12
30-Jan-24 CS F415 13
Major Tasks in Data Preprocessing

outliers=exceptions!
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
30-Jan-24
numerical data CS F415 15
Forms of data preprocessing

30-Jan-24 CS F415 16
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:

• Noise and outliers
• missing values
• duplicate data

30-Jan-24 CS F415 17
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems in data warehousing”—
Ralph Kimball
• “Data cleaning is the number one problem in data warehousing”—DCI survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

30-Jan-24 CS F415 18
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.

30-Jan-24 CS F415 19
Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values

• Eliminate Data Objects
• Estimate Missing Values
• Ignore the Missing Value During Analysis
• Replace with all possible values (weighted by their probabilities)
30-Jan-24 CS F415 20
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision
tree

30-Jan-24 CS F415 21
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data

30-Jan-24 CS F415 22
Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone

Two Sine Waves Two Sine Waves + Noise

30-Jan-24 CS F415 23
How to Handle Noisy Data?
• Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible
outliers)
• Regression
• smooth by fitting the data into regression functions

30-Jan-24 CS F415 24
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky.

30-Jan-24 CS F415 25
Binning Methods for Data Smoothing
• Sorted data (e.g., by price)
• 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34

30-Jan-24 CS F415 26
Cluster Analysis

30-Jan-24 CS F415 27
Regression
y

Y1’ y=x+1

X1 x

30-Jan-24 CS F415 28
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set

30-Jan-24 CS F415 29
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
• Major issue when merging data from heterogeneous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues

30-Jan-24 CS F415 30
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation

30-Jan-24 CS F415 31
Data Reduction Strategies
• A data warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on the complete
data set
• Data reduction
• Obtain a reduced representation of the data set that is much smaller in volume
but yet produce the same (or almost the same) analytical results
• Data reduction strategies
• Data cube aggregation
• Dimensionality reduction — remove unimportant attributes
• Data Compression
• Discretization and concept hierarchy generation

30-Jan-24 CS F415 32
Aggregation
• Combining two or more attributes (or objects) into a single
attribute (or object)
• Purpose
• Data reduction
• Reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• More “stable” data
• Aggregated data tends to have less variability

30-Jan-24 CS F415 33
Data Cube Aggregation
• The lowest level of a data cube
• the aggregated data for an individual entity of interest
• e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
• Further reduce the size of data to deal with
• Reference appropriate levels
• Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using
data cube, when possible

30-Jan-24 CS F415 34
SAMPLE CUBE
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV Total annual sales
PC ofU.S.A
PC in U.S.A.
VCR Total salesTotal annual sales
Total Q1 sales

Country
sum In U.S.Aof VCR in U.S.A.
In U.S.A Canada
Total sales
Total Q1 sales
In Canada In Canada Mexico
Total Q1 sales Total sales
In Mexico In Mexico sum
Total Q2 sales
Total Q1 sales TOTAL SALES
In all countries
In all countries

30-Jan-24 BITS Pilani 35

Aggregation
Variation of Precipitation in Australia

Standard Deviation of Standard Deviation of

Average Monthly Average Yearly
30-Jan-24 Precipitation CS F415 Precipitation 36
Sampling
• Sampling is the main technique employed for data selection.
• It is often used for both the preliminary investigation of the data and the final data
analysis.

• Statisticians sample because obtaining the entire set of data of interest is

too expensive or time consuming.

• Sampling is used in data mining because processing the entire set of data
of interest is too expensive or time consuming.

30-Jan-24 CS F415 37
Sampling …
• The key principle for effective sampling is the following:
• using a sample will work almost as well as using the entire data sets, if the
sample is representative

• A sample is representative if it has approximately the same property (of

interest) as the original set of data

30-Jan-24 CS F415 38
Types of Sampling
• Simple Random Sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• As each item is selected, it is removed from the population
• Sampling with replacement
• Objects are not removed from the population as they are selected for the
sample.
• In sampling with replacement, the same object can be picked up more than once
• Stratified sampling
• Split the data into several partitions; then draw random samples from each
partition

30-Jan-24 CS F415 39
Sampling
• Allow a mining algorithm to run in complexity that is potentially sub-
linear to the size of the data
• Choose a representative subset of the data
• Simple random sampling may have very poor performance in the presence of
skew
• Develop adaptive sampling methods
• Stratified sampling:
• Approximate the percentage of each class (or subpopulation of interest) in the
overall database
• Used in conjunction with skewed data
• Sampling may not reduce database I/Os (page at a time).

30-Jan-24 CS F415 40
Sampling

Raw Data
30-Jan-24 CS F415 41
Sampling
Raw Data Cluster/Stratified Sample

30-Jan-24 CS F415 42
Sample Size

8000 points 2000 Points 500 Points

30-Jan-24 CS F415 43
Sample Size
• What sample size is necessary to get at least one
object from each of 10 groups.

30-Jan-24 CS F415 44
Data Dimensionality
• From a theoretical point of view, increasing the number of
features should lead to better performance.

• In practice, the inclusion

of more features leads to
worse performance (i.e., curse
of dimensionality).

• The number of training examples required

increases exponentially with dimensionality.
Dimensionality Reduction
• Significant improvements can be achieved by first mapping
the data into a lower-dimensional space.

• Dimensionality can be reduced by:

− Combining features using a linear or non-linear transformations.
− Selecting a subset of features (i.e., feature selection).

46
Dimensionality Reduction
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise

• Techniques
• Principle Component Analysis
• Singular Value Decomposition
• Others: supervised and non-linear techniques

30-Jan-24 CS F415 47
Example of
Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

30-Jan-24 CS F415 48
Curse of Dimensionality
• When dimensionality increases,
data becomes increasingly sparse
in the space that it occupies

• Definitions of density and distance

between points, which is critical
for clustering and outlier
detection, become less
meaningful
• Randomly generate 500 points
• Compute difference between max and min
distance between any pair of points
Dimensionality Reduction (cont’d)
• Linear combinations are particularly attractive because they are simple to
compute and analytically tractable.

• Given x ϵ RN, the goal is to find an N x K matrix U such that:

y = UTx ϵ RK where K<<N

50
Dimensionality Reduction (cont’d)
• Idea: represent data in terms of basis vectors in a lower dimensional space
(embedded within the original space).

(1) Higher-dimensional space representation:

(2) Lower-dimensional sub-space representation:

51
Principal Component Analysis
• Given N data vectors from k-dimensions, find c ≤ k orthogonal
vectors that can be best used to represent data
• The original data set is reduced to one consisting of N data vectors on c
principal components (reduced dimensions)
• Each data vector is a linear combination of the c principal component
vectors
• Works for numeric data only
• Used when the number of dimensions is large

30-Jan-24 CS F415 52
Principal Component Analysis
X2

Y1
Y2

30-Jan-24 CS F415 53
Dimensionality Reduction: PCA
• Goal is to find a projection that captures the largest amount of
variation in data

30-Jan-24 CS F415 54
Dimensionality Reduction: PCA
• Find the eigenvectors of the covariance matrix
• The eigenvectors define the new space

x1
30-Jan-24 CS F415 55
Dimensionality Reduction: PCA
Dimensions = 206
Dimensions
Dimensions==160
120
10
40
80

30-Jan-24 CS F415 56
PCA: Motivation
• Choose directions such that a total variance of data
will be maximum
• Maximize Total Variance

• Choose directions that are orthogonal

• Minimize correlation

• Choose k<d orthogonal directions which maximize

total variance

57
Principal Component Analysis (PCA)
• Dimensionality reduction implies information loss; PCA
preserves as much information as possible by minimizing
the reconstruction error:

• How should we determine the “best” lower dimensional

space?

The “best” low-dimensional space can be determined by the

“best” eigenvectors of the covariance matrix of the data (i.e.,
the eigenvectors corresponding to the “largest” eigenvalues –
also called “principal components”).
58

Advanced Data Analysis in Neuroscience Integrating Statistical and Computational Models PDF
No ratings yet
Advanced Data Analysis in Neuroscience Integrating Statistical and Computational Models PDF
308 pages
Class-Data Preprocessing-III
No ratings yet
Class-Data Preprocessing-III
53 pages
Class-4-Data Preprocessing
No ratings yet
Class-4-Data Preprocessing
52 pages
Class 3 Introduction
No ratings yet
Class 3 Introduction
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
2nd Cleaning
No ratings yet
2nd Cleaning
46 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
CSCI322 - Lecture 2
No ratings yet
CSCI322 - Lecture 2
38 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Week 2
No ratings yet
Week 2
96 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Preprocessing and Data Cleaning
No ratings yet
Data Preprocessing and Data Cleaning
35 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Chapter3
No ratings yet
Chapter3
50 pages
Machine Learning Lecture 4 data types
No ratings yet
Machine Learning Lecture 4 data types
21 pages
Normalization
No ratings yet
Normalization
35 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Lecture123
No ratings yet
Lecture123
20 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
CS583 Data Prep
No ratings yet
CS583 Data Prep
33 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Week2-2
No ratings yet
Week2-2
25 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Prep
No ratings yet
Data Prep
33 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
253777
No ratings yet
253777
66 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Special Dies, Tools, Jigs & Fixtures World Summary: Market Values & Financials by Country
From Everand
Special Dies, Tools, Jigs & Fixtures World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
Control Charts: Six Sigma Thinking, #7
From Everand
Control Charts: Six Sigma Thinking, #7
Sumeet Savant
4/5 (1)
Special Dies & Tools, Die Sets, Jigs & Fixtures World Summary: Market Sector Values & Financials by Country
From Everand
Special Dies & Tools, Die Sets, Jigs & Fixtures World Summary: Market Sector Values & Financials by Country
Editorial DataGroup
No ratings yet
AP Statistics Flashcards, Fifth Edition: Up-to-Date Practice
From Everand
AP Statistics Flashcards, Fifth Edition: Up-to-Date Practice
Barron's Educational Series
No ratings yet
Dimensionality Reduction in Deep Learning For Chest X-Ray Analysis of Lung Cancer
No ratings yet
Dimensionality Reduction in Deep Learning For Chest X-Ray Analysis of Lung Cancer
6 pages
BCM601-Module 1
No ratings yet
BCM601-Module 1
35 pages
Machine Learning for Business Analytics: Concepts, Techniques and Applications with JMP Pro, 2nd Edition Galit Shmueli all chapter instant download
100% (3)
Machine Learning for Business Analytics: Concepts, Techniques and Applications with JMP Pro, 2nd Edition Galit Shmueli all chapter instant download
44 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Romi DM 03 Persiapan Mar2016
No ratings yet
Romi DM 03 Persiapan Mar2016
82 pages
Deep_learning_tackles_single_cell_analys
No ratings yet
Deep_learning_tackles_single_cell_analys
74 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
609b784052d21b6878c9321d - Syllabus Summer Training ShapeAi
No ratings yet
609b784052d21b6878c9321d - Syllabus Summer Training ShapeAi
8 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
CS F415 DATA MINING L1
No ratings yet
CS F415 DATA MINING L1
4 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
88 pages
Proposal (BigData)
No ratings yet
Proposal (BigData)
9 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
Ethiopian Sign Language Recognition Using Artificial Neural Network
No ratings yet
Ethiopian Sign Language Recognition Using Artificial Neural Network
6 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Personalized-marketing--Leveraging-AI-for-culturally_2025_Alexandria-Enginee
No ratings yet
Personalized-marketing--Leveraging-AI-for-culturally_2025_Alexandria-Enginee
14 pages
Data Mining Tasks Notes Given
No ratings yet
Data Mining Tasks Notes Given
26 pages
2nd Review PPT Template
No ratings yet
2nd Review PPT Template
13 pages
Machine Learning in Layman Language
No ratings yet
Machine Learning in Layman Language
11 pages
DPM 9
No ratings yet
DPM 9
39 pages
BAIL606-MLL
No ratings yet
BAIL606-MLL
3 pages
unit-iv-v-deep-learning-material
No ratings yet
unit-iv-v-deep-learning-material
32 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
33 pages
Anomaly Detection in Partical Physics
No ratings yet
Anomaly Detection in Partical Physics
179 pages
Could A Neuroscientist Understand A Microprocessor?: Eric Jonas Konrad Kording
No ratings yet
Could A Neuroscientist Understand A Microprocessor?: Eric Jonas Konrad Kording
15 pages
Jayalakshmi[1]
No ratings yet
Jayalakshmi[1]
68 pages
(Big Data For Industry 4.0) K. Suganthi, R. Karthik, G. Rajesh, Peter Ho Chiung Ching - Machine Learning and Deep Learning Techniques in Wireless and Mobile Networking Systems-CRC Press (2021)
No ratings yet
(Big Data For Industry 4.0) K. Suganthi, R. Karthik, G. Rajesh, Peter Ho Chiung Ching - Machine Learning and Deep Learning Techniques in Wireless and Mobile Networking Systems-CRC Press (2021)
285 pages

Class-Data Preprocessing-II

Uploaded by

Class-Data Preprocessing-II

Uploaded by

CS F415: Data Mining

30-Jan-24 C1S F415 1

30-Jan-24 C1S F415 2

1 Yes Single 125K No

• Such data set can be represented by an m by n matrix, where

10.23 5.27 15.22 2.7 1.2

• Examples of data quality problems:

• Handling missing values

Two Sine Waves Two Sine Waves + Noise

30-Jan-24 BITS Pilani 35

Standard Deviation of Standard Deviation of

• Statisticians sample because obtaining the entire set of data of interest is

• A sample is representative if it has approximately the same property (of

8000 points 2000 Points 500 Points

• In practice, the inclusion

• The number of training examples required

• Dimensionality can be reduced by:

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

• Definitions of density and distance

• Given x ϵ RN, the goal is to find an N x K matrix U such that:

y = UTx ϵ RK where K<<N

(1) Higher-dimensional space representation:

(2) Lower-dimensional sub-space representation:

• Choose directions that are orthogonal

• Choose k<d orthogonal directions which maximize

• How should we determine the “best” lower dimensional

The “best” low-dimensional space can be determined by the

You might also like