0% found this document useful (0 votes)

2 views

Machine Learning Lecture 4 data types

The document provides an overview of data and its preprocessing, detailing types of attributes such as nominal, ordinal, interval, and ratio, as well as discrete and continuous attributes. It categorizes data sets into record, graph, and ordered types, and discusses data quality issues including noise, outliers, missing values, and duplicate data. The document emphasizes the importance of data quality and methods for handling various data quality problems.

Uploaded by

nimranadeem242

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Machine Learning Lecture 4 data types

Uploaded by

nimranadeem242

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

DATA AND

PREPROCESSING

1
WHAT IS DATA?

 Collection of data objects Attributes

and their attributes

 An attribute is a property
or characteristic of an
object
– Examples: eye color of
a person,
temperature, etc.
– Attribute is also known as
 A collection
variable, of
field, Objects
characteristic,
attributes describe or an
feature
object
– Object is also known as
record, point, case, sample,
entity, or instance
TYPES OF ATTRIBUTES

 There are different types of attributes

– Nominal
 Examples: ID numbers, eye color,
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
– Interval
 Examples: calendar dates
– Ratio
 Examples: temperature in Kelvin, length, time,
counts

3
DISCRETE AND CONTINUOUS ATTRIBUTES

 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes

 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as
floating- point variables.
4
TYPES OF DATA SETS
 Record
– Data Matrix
– Document Data
– Transaction Data
 Graph
– World Wide Web
– Molecular Structures
 Ordered
– Temporal Data
– Sequential Data
– Genetic Sequence
Data

5
RECORD DATA

 Data that consists of a collection of records,

each of which consists of a fixed set of
attributes

6
DATA MATRIX

 If data objects have the same fixed set of

numeric attributes, then the data objects can
be thought of as points in a multi-dimensional
space, where each dimension represents a
distinct attribute

 Such data set can be represented by an m by n

matrix, where there are m rows, one for each
object, and n columns, one for each attribute

7
DOCUMENT DATA

 Each document becomes a `term' vector,

– each term is a component (attribute) of the vector,
– the value of each component is the number of
times the corresponding term occurs in the
document.

8
TRANSACTION DATA

 A special type of record data, where

– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.

item

transaction

9
GRAPH DATA

 Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb"> Data

Mining </a>
<li> <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li> <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of
Equations </a>
<li> <a href="papers/papers.html#ffff"> N-
Body Computation and Dense Linear
System Solvers

10
CHEMICAL DATA

Benzene Molecule:
C6H6

11
ORDERED DATA

 Sequences of
transactions

Items/Events

An element of
the 13
sequence
ORDERED DATA

 Genomic sequence
data

13
ORDERED DATA

Spatio-Temporal
Data

Average Monthly
Temperature of
land and ocean

Trajectories of
Moving Objects

14
Spatial Data: Refer to the location-related aspects of
data

Application: Healthcare, environmental studies,

geography Land

Temporal Data: Time-Related Aspects e.g. hours days,

years

Application: Weather Forecasting, E-Commerce,

Education
DATA QUALITY

 What kinds of data quality problems?

 How can we detect problems with the
data?
 What can we do about these problems?

 Examples of data quality

problems:
– Noise and outliers
– missing values
– duplicate data

16
NOISE

 Noise refers to modification of original values

– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen

17
Two Sine Waves Two Sine Waves + Noise
OUTLIERS

 Outliers are data objects with characteristics that

are considerably different than most of the other
data objects in the data set

18
DEVIATION/ANOMALY DETECTION

 Outliers are useful when we need to detect

significant deviations from normal behavior
 Applications:

 Credit Card Fraud Detection

 Network
Intrusion
Detection

19
day
MISSING VALUES

 Reasons for missing

values
– Information is not collected
(e.g., people decline to
give their age and
weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)

 Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
20
– Replace with all possible values (weighted by
their probabilities)
DUPLICATE DATA

 Data set may include data objects that are

duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources

 Examples:
– Same person with multiple email addresses

 Data cleaning
– Process of dealing with duplicate data issues

Even More Maths Investigation Ideas For A
100% (1)
Even More Maths Investigation Ideas For A
80 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
Lecture2_IntroData
No ratings yet
Lecture2_IntroData
16 pages
CIS62283 02 PreProcessing
100% (1)
CIS62283 02 PreProcessing
51 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
CL 2
No ratings yet
CL 2
85 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
2020 intro
No ratings yet
2020 intro
58 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
Class 2 Introduction to Data
No ratings yet
Class 2 Introduction to Data
40 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Full
No ratings yet
Full
367 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
0% (1)
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
55 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Module 1 Part1
No ratings yet
Module 1 Part1
68 pages
ppt2
No ratings yet
ppt2
57 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
ITS632 Lecture2 Data
No ratings yet
ITS632 Lecture2 Data
61 pages
Lecture03 Understanding Data
No ratings yet
Lecture03 Understanding Data
114 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Datamining-lect2 - What is Data_ the Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization (1)
No ratings yet
Datamining-lect2 - What is Data_ the Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization (1)
94 pages
Module 1_Aug 2024
No ratings yet
Module 1_Aug 2024
93 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
ML-Lecture-4-data
No ratings yet
ML-Lecture-4-data
22 pages
Class-4-Data Preprocessing
No ratings yet
Class-4-Data Preprocessing
52 pages
Chapter 02 Data and Data Preprocessing
No ratings yet
Chapter 02 Data and Data Preprocessing
74 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
Data Mining
No ratings yet
Data Mining
40 pages
Attributes
No ratings yet
Attributes
66 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
Updated DM
No ratings yet
Updated DM
72 pages
Chapter 2.1 2.2
No ratings yet
Chapter 2.1 2.2
40 pages
Types of Data in Data Mining
No ratings yet
Types of Data in Data Mining
16 pages
Data
No ratings yet
Data
84 pages
L1
No ratings yet
L1
44 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Air Conditioning2024
No ratings yet
Air Conditioning2024
10 pages
Identifikasi Faktor Eksternal Dan Faktor Internal Yang Berpengaruh Terhadap Kinerja Ukm Mebel Rotan Di Jepara
No ratings yet
Identifikasi Faktor Eksternal Dan Faktor Internal Yang Berpengaruh Terhadap Kinerja Ukm Mebel Rotan Di Jepara
8 pages
2k2901k90rjg920jr90jf90j093j0e9jgr0j0j40j09j0jr01dj09k09kt03kd09k094k02k0kg0i!o$ "!) ? "!i$?i? ) KC? "!?K?" DK? !KD? K ) K) KF ? K K DK FK"K K FK "K K
No ratings yet
2k2901k90rjg920jr90jf90j093j0e9jgr0j0j40j09j0jr01dj09k09kt03kd09k094k02k0kg0i!o$ "!) ? "!i$?i? ) KC? "!?K?" DK? !KD? K ) K) KF ? K K DK FK"K K FK "K K
114 pages
Down-Control-of-relaxation-cracking-in-austenitic-high-temperature-components-Van-Wortel
No ratings yet
Down-Control-of-relaxation-cracking-in-austenitic-high-temperature-components-Van-Wortel
60 pages
125marra 3
No ratings yet
125marra 3
6 pages
Basic Concepts On Special and Inclusive Education
No ratings yet
Basic Concepts On Special and Inclusive Education
5 pages
2006 - Oh Et Al - Using Submerged Geotextile Tubes in The Protection of The E Korean Shore
No ratings yet
2006 - Oh Et Al - Using Submerged Geotextile Tubes in The Protection of The E Korean Shore
17 pages
Deckshield Id May 2019
No ratings yet
Deckshield Id May 2019
2 pages
Calculus II Volumes: Cory Robinson & Blake Moyer With Advisor Jeremy Becnel, PH.D
No ratings yet
Calculus II Volumes: Cory Robinson & Blake Moyer With Advisor Jeremy Becnel, PH.D
1 page
Lubunca 01
100% (1)
Lubunca 01
55 pages
Asia
No ratings yet
Asia
300 pages
Cutting List Format
No ratings yet
Cutting List Format
12 pages
Fiitjee: Common Test
No ratings yet
Fiitjee: Common Test
2 pages
Risiko Terkait Perilaku Merokok Di Dalam Rumah Selama Masa Pandemi
No ratings yet
Risiko Terkait Perilaku Merokok Di Dalam Rumah Selama Masa Pandemi
16 pages
Set B
No ratings yet
Set B
16 pages
GBH Enterprises, LTD.: Process Engineering Guide
No ratings yet
GBH Enterprises, LTD.: Process Engineering Guide
14 pages
University of Leeds Materials Science and Egineering
No ratings yet
University of Leeds Materials Science and Egineering
2 pages
Mil PRF 32383
No ratings yet
Mil PRF 32383
36 pages
Glycosaminoglycans: (Mucopolysaccharides)
No ratings yet
Glycosaminoglycans: (Mucopolysaccharides)
50 pages
SEIMAF UK - Application Form
No ratings yet
SEIMAF UK - Application Form
5 pages
Arithmetic 1
No ratings yet
Arithmetic 1
92 pages
chadah-2005-environmental-law-in-india
No ratings yet
chadah-2005-environmental-law-in-india
3 pages
Final LP in Science Q4
No ratings yet
Final LP in Science Q4
4 pages
Break Up (2024-2025) - x Social Science
No ratings yet
Break Up (2024-2025) - x Social Science
3 pages
Common Submission Dossier Template (CSDT)
No ratings yet
Common Submission Dossier Template (CSDT)
38 pages
1&2) Diploma in Manufacturing Engineering - Learn & Earn
No ratings yet
1&2) Diploma in Manufacturing Engineering - Learn & Earn
72 pages
BITSAT Solved Paper 2009 PDF
No ratings yet
BITSAT Solved Paper 2009 PDF
132 pages
Probability in Excel
No ratings yet
Probability in Excel
6 pages
Math Expressions Homework and Remembering Grade 4 Answer Key
100% (1)
Math Expressions Homework and Remembering Grade 4 Answer Key
5 pages

Machine Learning Lecture 4 data types

Uploaded by

Machine Learning Lecture 4 data types

Uploaded by

DATA AND

 Collection of data objects Attributes

 There are different types of attributes

 Data that consists of a collection of records,

 If data objects have the same fixed set of

 Such data set can be represented by an m by n

 Each document becomes a `term' vector,

 A special type of record data, where

 Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb"> Data

Application: Healthcare, environmental studies,

Temporal Data: Time-Related Aspects e.g. hours days,

Application: Weather Forecasting, E-Commerce,

 What kinds of data quality problems?

 Examples of data quality

 Noise refers to modification of original values

 Outliers are data objects with characteristics that

 Outliers are useful when we need to detect

 Credit Card Fraud Detection

 Reasons for missing

 Handling missing values

 Data set may include data objects that are

You might also like