0% found this document useful (0 votes)

15 views

Class-4-Data Preprocessing

This document provides an overview of the CS F415: Data Mining course. It discusses topics that will be covered such as data preprocessing, data mining versus database management systems, data warehousing and online analytical processing. Examples are given to illustrate the differences between DBMS, OLAP and data mining approaches. Major issues in data warehousing and mining are also outlined such as performance, data types, applications and privacy. The importance of data preprocessing is emphasized to obtain quality data and ensure quality mining results.

Uploaded by

f20201207

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Class-4-Data Preprocessing

Uploaded by

f20201207

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

CS F415: Data Mining

Yashvardhan Sharma

30-Jan-24 C1S F415 1

Today’s Outline
•Introduction - Overview
•Data Preprocessing
• Why preprocess the data?
• Types of Data Sets
• Data Quality
• Steps in Data Preprocessing

30-Jan-24 C1S F415 2

Data Mining vs. DBMS
• Example DBMS Reports
• Last months sales for each service type
• Sales per service grouped by customer sex or age bracket
• List of customers who lapsed their policy

• Questions answered using Data Mining

• What characteristics do customers that lapse their policy have in
common and how do they differ from customers who renew their
policy?
• Which motor insurance policy holders would be potential customers
for my House Content Insurance policy?

30-Jan-24 CS F415 3
Data Mining and Data Warehousing
• Data Warehouse: a centralized data repository which can be queried
for business benefit.
• Data Warehousing makes it possible to
• extract archived operational data
• overcome inconsistencies between different legacy data formats
• integrate data throughout an enterprise, regardless of location, format, or
communication requirements
• incorporate additional or expert information
• OLAP: On-line Analytical Processing
• Multi-Dimensional Data Model (Data Cube)
• Operations:
• Roll-up
• Drill-down
• Slice and dice
• Rotate
30-Jan-24 CS F415 4
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta Data

Filtering&Integration Database API Filtering

Layer1
Data cleaning Data
Databases Data
Data integration Warehouse
30-Jan-24 CS F415 Repository 5
DBMS, OLAP, and Data Mining

DBMS OLAP Data Mining

Knowledge discovery
Extraction of detailed Summaries, trends and
Task of hidden patterns
and summary data forecasts
and insights

Type of result Information Analysis Insight and Prediction

Multidimensional data Induction (Build the

Deduction (Ask the
modeling, model, apply it to
Method question, verify
Aggregation, new data, get the
with data)
Statistics result)

What is the average Who will buy a

Who purchased
income of mutual mutual fund in the
Example question mutual funds in
fund buyers by next 6 months and
the last 3 years?
region by year? why?

30-Jan-24 CS F415 6
Example of DBMS, OLAP and Data Mining: Weather Data
DBMS:
Day outlook temperature humidity windy play

1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes

30-Jan-24 14 rainy 71 CS91F415 true no 7

Example of DBMS, OLAP and Data Mining: Weather Data
• By querying a DBMS containing the above table we may answer
questions like:
• What was the temperature in the sunny days?
{85, 80, 72, 69, 75}
• Which days the humidity was less than 75?
{6, 7, 9, 11}
• Which days the temperature was greater than 70?
{1, 2, 3, 8, 10, 11, 12, 13, 14}
• Which days the temperature was greater than 70 and the
humidity was less than 75?
The intersection of the above two: {11}

30-Jan-24 CS F415 8
Example of DBMS, OLAP and Data Mining: Weather Data
OLAP:
• Using OLAP we can create a Multidimensional Model of our data (Data
Cube).
• For example using the dimensions: time, outlook and play we can create
the following model.

9/5 sunny rainy overcast

Week 1 0/2 2/1 2/0

Week 2 2/1 1/1 2/0

30-Jan-24 CS F415 9
Example of DBMS, OLAP and Data Mining: Weather Data

Data Mining:
• Using the ID3 algorithm we can produce the following
decision tree:

• outlook = sunny
• humidity = high: no
• humidity = normal: yes
• outlook = overcast: yes
• outlook = rainy
• windy = true: no
• windy = false: yes

30-Jan-24 CS F415 10
Major Issues in Data Warehousing and Mining
• Mining methodology and user interaction
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad-hoc data mining
• Expression and visualization of data mining results
• Handling noise and incomplete data
• Pattern evaluation: the interestingness problem
• Performance and scalability
• Efficiency and scalability of data mining algorithms
• Parallel, distributed and incremental mining methods
30-Jan-24 CS F415 11
Major Issues in Data Warehousing and Mining
• Issues relating to the diversity of data types
• Handling relational and complex types of data
• Mining information from heterogeneous databases and global information
systems (WWW)
• Issues related to applications and social impacts
• Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
• Integration of the discovered knowledge with existing knowledge: A knowledge
fusion problem
• Protection of data security, integrity, and privacy

30-Jan-24 CS F415 12
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
• Required for both OLAP and Data Mining!

30-Jan-24 CS F415 13
Why can Data be Incomplete?
• Attributes of interest are not available (e.g., customer information for
sales transaction data)
• Data were not considered important at the time of transactions, so they
were not recorded!
• Data not recorded because of misunderstanding or malfunctions
• Data may have been recorded and later deleted!
• Missing/unknown values for some data

30-Jan-24 CS F415 14
Why can Data be Noisy/Inconsistent?
• Faulty instruments for data collection
• Human or computer errors
• Errors in data transmission
• Technology limitations (e.g., sensor data come at a faster rate
than they can be processed)
• Inconsistencies in naming conventions or data codes (e.g.,
2/5/2018 could be 2 May 2018 or 5 Feb 2018)
• Duplicate tuples, which were received twice should also be
removed

30-Jan-24 CS F415 15
What is Data?
• Collection of data objects and
their attributes Attributes

• An attribute is a property or Tid Refund Marital Taxable

Status Income Cheat
characteristic of an object
1 Yes Single 125K No
• Examples: eye color of a person,
2 No Married 100K No
temperature, etc.
3 No Single 70K No
• Attribute is also known as variable, 4 Yes Married 120K No
field, characteristic, or feature Objects 5 No Divorced 95K Yes

• A collection of attributes describe 6 No Married 60K No

an object 7 Yes Divorced 220K No

8 No Single 85K Yes
• Object is also known as record,
9 No Married 75K No
point, case, sample, entity, or
10 No Single 90K Yes
instance 10

30-Jan-24 CS F415 16
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute

• Distinction between attributes and attribute values

• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

• Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
• ID has no limit but age has a maximum and minimum value

30-Jan-24 CS F415 17
Measurement of Length
• The way you measure an attribute is somewhat may not match
the attributes properties.
5 A 1

B
7 2

8 3

10 4

15 5

30-Jan-24 CS F415 18
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
• Distinctness: = 
• Order: < >
• Addition: + -
• Multiplication: */

• Nominal attribute: distinctness

• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & addition
• Ratio attribute: all 4 properties

30-Jan-24 CS F415 19
Types of Attributes
• There are different types of attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• Examples: temperature in Kelvin, length, time, counts

30-Jan-24 CS F415 20
Attribute Type Description Examples Operations

Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,
provide enough information to order {good, better, best}, rank correlation, run
objects. (<, >) grades, street numbers tests, sign tests

Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a temperature in Celsius or deviation, Pearson's
unit of measurement exists. Fahrenheit correlation, t and F
(+, - ) tests

Ratio For ratio variables, both differences and temperature in Kelvin, geometric mean,
ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, length, percent variation
electrical current

30-Jan-24 CS F415 21
Attribute Level Transformation Comments

Nominal Any permutation of values If all employee ID numbers were

reassigned, would it make any
difference?

Ordinal An order preserving change of values, i.e., An attribute encompassing the notion of
new_value = f(old_value) good, better best can be represented
where f is a monotonic function. equally well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value =a * old_value + b where a and b are Thus, the Fahrenheit and Celsius
constants temperature scales differ in terms of
where their zero value is and the size of
a unit (degree).

Ratio new_value = a * old_value Length can be measured in meters or

feet.
30-Jan-24 CS F415 22
Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes

• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite
number of digits.
• Continuous attributes are typically represented as floating-point variables.

30-Jan-24 CS F415 23
Important Characteristics of Structured Data

• Dimensionality
• Curse of Dimensionality

• Sparsity
• Only presence counts

• Resolution
• Patterns depend on the scale

30-Jan-24 CS F415 24
Types of data sets
• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
30-Jan-24 CS F415 25
Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
30-Jan-24 10

CS F415 26
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

30-Jan-24 CS F415 27
Data Matrix
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a
distinct attribute

• Such data set can be represented by an m by n matrix, where

there are m rows, one for each object, and n columns, one for
each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

30-Jan-24 CS F415 28
Document – term matrix
• Each document becomes a ‘term’ vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
30-Jan-24 CS F415 29
Graph Data
• Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
5 1 <li>
<a href="papers/papers.html#aaaa">
2 Parallel Solution of Sparse Linear System of Equations </a>
<li>
5 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers

30-Jan-24 CS F415 30
Chemical Data
• Benzene Molecule: C6H6

30-Jan-24 CS F415 31
Ordered Data
• Sequences of transactions

Items/Events

An element of
the sequence
30-Jan-24 CS F415 32
Ordered Data
• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

30-Jan-24 CS F415 33
Ordered Data
• Spatio-Temporal Data

Average
Monthly
Temperature of
land and ocean

30-Jan-24 CS F415 34
30-Jan-24 CS F415 35
Major Tasks in Data Preprocessing

outliers=exceptions!
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
30-Jan-24
numerical data CS F415 37
Forms of data preprocessing

30-Jan-24 CS F415 38
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:

• Noise and outliers
• missing values
• duplicate data

30-Jan-24 CS F415 39
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems in data warehousing”—
Ralph Kimball
• “Data cleaning is the number one problem in data warehousing”—DCI survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

30-Jan-24 CS F415 40
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.

30-Jan-24 CS F415 41
Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values

• Eliminate Data Objects
• Estimate Missing Values
• Ignore the Missing Value During Analysis
• Replace with all possible values (weighted by their probabilities)
30-Jan-24 CS F415 42
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision
tree

30-Jan-24 CS F415 43
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data

30-Jan-24 CS F415 44
Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone

Two Sine Waves Two Sine Waves + Noise

30-Jan-24 CS F415 45
How to Handle Noisy Data?
• Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible
outliers)
• Regression
• smooth by fitting the data into regression functions

30-Jan-24 CS F415 46
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky.

30-Jan-24 CS F415 47
Binning Methods for Data Smoothing
• Sorted data (e.g., by price)
• 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34

30-Jan-24 CS F415 48
Cluster Analysis

30-Jan-24 CS F415 49
Regression
y

Y1’ y=x+1

X1 x

30-Jan-24 CS F415 50
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set

30-Jan-24 CS F415 51
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
• Major issue when merging data from heterogeneous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues

30-Jan-24 CS F415 52
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation

30-Jan-24 CS F415 53

CRISP-DM 1.0 Step-By-Step Data Mining Guide
100% (1)
CRISP-DM 1.0 Step-By-Step Data Mining Guide
60 pages
Class 3 Introduction
No ratings yet
Class 3 Introduction
32 pages
Class-Data Preprocessing-II
No ratings yet
Class-Data Preprocessing-II
57 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
datamining-1class
No ratings yet
datamining-1class
76 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Machine Learning Lecture 4 data types
No ratings yet
Machine Learning Lecture 4 data types
21 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Full
No ratings yet
Full
367 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Lecture2_IntroData
No ratings yet
Lecture2_IntroData
16 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
2020 intro
No ratings yet
2020 intro
58 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
CIS62283 02 PreProcessing
100% (1)
CIS62283 02 PreProcessing
51 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
DataMining Unit I Notes
No ratings yet
DataMining Unit I Notes
28 pages
Unit I
No ratings yet
Unit I
57 pages
chapter 2
No ratings yet
chapter 2
57 pages
SCSA3001-1-58
No ratings yet
SCSA3001-1-58
58 pages
Data Mining
No ratings yet
Data Mining
40 pages
Satyabhama Bigdata
No ratings yet
Satyabhama Bigdata
128 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
9 MidReview
No ratings yet
9 MidReview
25 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
Data Mining Unit I notes
No ratings yet
Data Mining Unit I notes
29 pages
Data Accquisition
No ratings yet
Data Accquisition
6 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
Class 2 Introduction to Data
No ratings yet
Class 2 Introduction to Data
40 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
DWDM REFERENCE NOTES
No ratings yet
DWDM REFERENCE NOTES
126 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Chap2 Data
No ratings yet
Chap2 Data
68 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
CS822-DataMining-Week1 (1)
No ratings yet
CS822-DataMining-Week1 (1)
97 pages
Attributes
No ratings yet
Attributes
66 pages
ITS632 Lecture2 Data
No ratings yet
ITS632 Lecture2 Data
61 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Module 1_Aug 2024
No ratings yet
Module 1_Aug 2024
93 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Updated DM
No ratings yet
Updated DM
72 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Class-Data Preprocessing-III
No ratings yet
Class-Data Preprocessing-III
53 pages
05 AP Ethical Perspective Mill 2 Kant 3
No ratings yet
05 AP Ethical Perspective Mill 2 Kant 3
19 pages
04 AP Ethical Perspective Aristotle 1
No ratings yet
04 AP Ethical Perspective Aristotle 1
17 pages
Does The NFL Combine Really Matter: University of California at Berkeley
No ratings yet
Does The NFL Combine Really Matter: University of California at Berkeley
15 pages
Applied Categorical and Count Data Analysis (PDFDrive)
50% (2)
Applied Categorical and Count Data Analysis (PDFDrive)
380 pages
Reliability Test Sample
No ratings yet
Reliability Test Sample
5 pages
2013 - Notes - R Trinker'S - Notes
No ratings yet
2013 - Notes - R Trinker'S - Notes
274 pages
Applied Longitudinal Analysis 2nd Edition download
No ratings yet
Applied Longitudinal Analysis 2nd Edition download
64 pages
Data Cleaning Thesis
100% (2)
Data Cleaning Thesis
5 pages
paper2
No ratings yet
paper2
9 pages
Efron 1994
100% (1)
Efron 1994
14 pages
Chapter 3
No ratings yet
Chapter 3
18 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
House Price Prediction
No ratings yet
House Price Prediction
59 pages
Data Analysis
No ratings yet
Data Analysis
3 pages
Stata Guide V1
No ratings yet
Stata Guide V1
65 pages
Effects of Range of Motion On Resistance Training Adaptations: A Systematic Review and Meta-Analysis
No ratings yet
Effects of Range of Motion On Resistance Training Adaptations: A Systematic Review and Meta-Analysis
16 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Flexible Imputation of Missing Data
100% (2)
Flexible Imputation of Missing Data
444 pages
Data Preprocessing in Python
No ratings yet
Data Preprocessing in Python
3 pages
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
No ratings yet
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
5 pages
mental-2025-1-e66665
No ratings yet
mental-2025-1-e66665
11 pages
Effects of an Intervention Designed to Enhance Romantic Relationship Excitement: A Randomized-Control Trial
No ratings yet
Effects of an Intervention Designed to Enhance Romantic Relationship Excitement: A Randomized-Control Trial
14 pages
Proc Freq
No ratings yet
Proc Freq
57 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
Dimensionality Reduction (Pca)
No ratings yet
Dimensionality Reduction (Pca)
32 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Human Growth Development and Nutrition
No ratings yet
Human Growth Development and Nutrition
13 pages
NCKH ĐỀ 4
No ratings yet
NCKH ĐỀ 4
20 pages
Holwerda 2013 - Predictors of Work Participation of Young Adults With Mild ID
No ratings yet
Holwerda 2013 - Predictors of Work Participation of Young Adults With Mild ID
10 pages
Scale For Ranking Health Conditions and Problems in Family Nursing Practice
100% (1)
Scale For Ranking Health Conditions and Problems in Family Nursing Practice
3 pages
DATA Warehouse MCQs
No ratings yet
DATA Warehouse MCQs
41 pages

Class-4-Data Preprocessing

Uploaded by

Class-4-Data Preprocessing

Uploaded by

CS F415: Data Mining

30-Jan-24 C1S F415 1

30-Jan-24 C1S F415 2

• Questions answered using Data Mining

Data Cube API

Filtering&Integration Database API Filtering

DBMS OLAP Data Mining

Type of result Information Analysis Insight and Prediction

Multidimensional data Induction (Build the

What is the average Who will buy a

30-Jan-24 14 rainy 71 CS91F415 true no 7

9/5 sunny rainy overcast

Week 1 0/2 2/1 2/0

Week 2 2/1 1/1 2/0

• An attribute is a property or Tid Refund Marital Taxable

• A collection of attributes describe 6 No Married 60K No

an object 7 Yes Divorced 220K No

• Distinction between attributes and attribute values

• Different attributes can be mapped to the same set of values

• Nominal attribute: distinctness

Nominal Any permutation of values If all employee ID numbers were

Ratio new_value = a * old_value Length can be measured in meters or

1 Yes Single 125K No

• Such data set can be represented by an m by n matrix, where

10.23 5.27 15.22 2.7 1.2

• Examples of data quality problems:

• Handling missing values

Two Sine Waves Two Sine Waves + Noise

You might also like