0% found this document useful (0 votes)
16 views

CH2 Data

Uploaded by

Hunzila Nisar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

CH2 Data

Uploaded by

Hunzila Nisar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

ARIN2137

KNOWLEDGE DISCOVERY AND DATA


MINING

TOPIC 2 :

1
Outline
1. Definition of Dataset, Attribute
2. Types of Attribute
3. Types of Dataset
4. Issues related to Dataset

2
DATA SOURCE

Identif Perform Perform Pattern Applying


y data data pre- data Evaluation knowledge
source processing mining

Where data comes from? - primary and


secondary sources
• Transactional database ~ IMTIAZ
•, Hospital
Why we need data: •Data warehouse
•Survey and observational (behavioral,
Provide the necessary
attitudes, opinions)
input, measure •Experimental (scientific data-DNA)
performance, assist
•Server Log
formulating solution and
•Using published sources of data (eg:
satisfy our curiosity. ABI/INFORM data for business data, UCI
Machine Learning database for KDD).
3
•Data can be in fix row & column format,
image, audio, video!
DATA TYPE
• Structured
(GENERAL)
– Well defined field - most business dataset
• Semi Structured
– Electronic image of business document, medical report,
executive summaries, repair manual
• Unstructured
– Video recorded by a surveillance camera
Tools for
these
data?
4
dataset

• Data set: collection of data records

• Other name: data objects, records, point, event, case, sample, observation, entity

• Describe by “Attribute” ~ capture the basic characteristic of dataset

5
dataset
• Data set is a file, which consists of record (or object, pattern, case,
• sample) in row and attribute (or field, attribute, dimension,
variable)
File Name: Student.xls
• in column

Data Roll No. Name Year CGPA


Set
record 64752 Anas Kareem 2 3.6
67984 Hajra Shahid 3 3.4
74571 … .. ..

Attribute
DATAset :
attribute
• Is a property or characteristic of record that
may vary, either from one object to another or
from one time to another.

What
attribute can
describe this
aeroplane?
Grasshopper?
DATAset :
attribute &
record

Data Set

Records

Attributes
DATAset : types
of attribute
Discrete Continuous
[Nominal and Ordinal] [Interval and Ratio]

Has only a finite or countably


Has real numbers as attribute values
infinite set of values

Practically, real values can only be


Often represented as integer
variables measured and represented using a
finite number of digits

binary attributes are a special


typically represented as floating-
case of discrete attributes
point variables.

Eg: zip codes, counts, or the set


of words in a collection Eg: temperature, height, weight
of documents
DATAset : types
of attribute

N
U
M
E
R
I
C
12
types of dataset

Record
data

Graph-
Ordered
based
data
data

13
• Collection of records, each of
which consists of a fixed set of
attributes
• Stored in flat file, relational
database
• Types: Market-Basket Data
(Transaction data), Data Matrix,
Sparse Data Matrix

14
• Data is represented in form of graph ~
relationship in graph, link in website

15
 The attributes have relationships that involve order in time
or space.
 Example:
Sequential data/temporal data – has a time associated
with it.
Sequence data – consists of a data set that is a
sequence of individual entities (exp: a sequence of words
or letters). No time stamps.
Time Series data – a special type of sequential data
(each record is a time series – a series of measurements
taken over time).
Spatial data – such as positions or areas.

16
17
18
19
20
21
Data in reality!
• Too many data.. However, far from perfect!

• Data in the real world is dirty, no quality


– incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
• e.g., occupation=“”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
How to ‘clean’ dirty data, we perform data preprocessing

Input Data Data Post


Information
data Preprocessing Mining Processing
22
UCI Machine Learning Repository

23

http://archive.ics.uci.edu/ml/datasets.html
24
25

You might also like