0% found this document useful (0 votes)
2 views

Machine Learning Lecture 4 data types

The document provides an overview of data and its preprocessing, detailing types of attributes such as nominal, ordinal, interval, and ratio, as well as discrete and continuous attributes. It categorizes data sets into record, graph, and ordered types, and discusses data quality issues including noise, outliers, missing values, and duplicate data. The document emphasizes the importance of data quality and methods for handling various data quality problems.

Uploaded by

nimranadeem242
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine Learning Lecture 4 data types

The document provides an overview of data and its preprocessing, detailing types of attributes such as nominal, ordinal, interval, and ratio, as well as discrete and continuous attributes. It categorizes data sets into record, graph, and ordered types, and discusses data quality issues including noise, outliers, missing values, and duplicate data. The document emphasizes the importance of data quality and methods for handling various data quality problems.

Uploaded by

nimranadeem242
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

DATA AND

PREPROCESSING

1
WHAT IS DATA?

 Collection of data objects Attributes


and their attributes

 An attribute is a property
or characteristic of an
object
– Examples: eye color of
a person,
temperature, etc.
– Attribute is also known as
 A collection
variable, of
field, Objects
characteristic,
attributes describe or an
feature
object
– Object is also known as
record, point, case, sample,
entity, or instance
TYPES OF ATTRIBUTES

 There are different types of attributes


– Nominal
 Examples: ID numbers, eye color,
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
– Interval
 Examples: calendar dates
– Ratio
 Examples: temperature in Kelvin, length, time,
counts

3
DISCRETE AND CONTINUOUS ATTRIBUTES

 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes

 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as
floating- point variables.
4
TYPES OF DATA SETS
 Record
– Data Matrix
– Document Data
– Transaction Data
 Graph
– World Wide Web
– Molecular Structures
 Ordered
– Temporal Data
– Sequential Data
– Genetic Sequence
Data

5
RECORD DATA

 Data that consists of a collection of records,


each of which consists of a fixed set of
attributes

6
DATA MATRIX

 If data objects have the same fixed set of


numeric attributes, then the data objects can
be thought of as points in a multi-dimensional
space, where each dimension represents a
distinct attribute

 Such data set can be represented by an m by n


matrix, where there are m rows, one for each
object, and n columns, one for each attribute

7
DOCUMENT DATA

 Each document becomes a `term' vector,


– each term is a component (attribute) of the vector,
– the value of each component is the number of
times the corresponding term occurs in the
document.

8
TRANSACTION DATA

 A special type of record data, where


– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.

item

transaction

9
GRAPH DATA

 Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb"> Data


Mining </a>
<li> <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li> <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of
Equations </a>
<li> <a href="papers/papers.html#ffff"> N-
Body Computation and Dense Linear
System Solvers

10
CHEMICAL DATA

Benzene Molecule:
C6H6

11
ORDERED DATA

 Sequences of
transactions

Items/Events

An element of
the 13
sequence
ORDERED DATA

 Genomic sequence
data

13
ORDERED DATA

Spatio-Temporal
Data

Average Monthly
Temperature of
land and ocean

Trajectories of
Moving Objects

14
Spatial Data: Refer to the location-related aspects of
data

Application: Healthcare, environmental studies,


geography Land

Temporal Data: Time-Related Aspects e.g. hours days,


years

Application: Weather Forecasting, E-Commerce,


Education
DATA QUALITY

 What kinds of data quality problems?


 How can we detect problems with the
data?
 What can we do about these problems?

 Examples of data quality


problems:
– Noise and outliers
– missing values
– duplicate data

16
NOISE

 Noise refers to modification of original values


– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen

17
Two Sine Waves Two Sine Waves + Noise
OUTLIERS

 Outliers are data objects with characteristics that


are considerably different than most of the other
data objects in the data set

18
DEVIATION/ANOMALY DETECTION

 Outliers are useful when we need to detect


significant deviations from normal behavior
 Applications:

 Credit Card Fraud Detection

 Network
Intrusion
Detection

19
day
MISSING VALUES

 Reasons for missing


values
– Information is not collected
(e.g., people decline to
give their age and
weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)

 Handling missing values


– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
20
– Replace with all possible values (weighted by
their probabilities)
DUPLICATE DATA

 Data set may include data objects that are


duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources

 Examples:
– Same person with multiple email addresses

 Data cleaning
– Process of dealing with duplicate data issues

21

You might also like