0% found this document useful (0 votes)
17 views33 pages

CS F415 L2 - Data

The document describes data and various data types. It defines data as a collection of objects and their attributes. It lists different types of attributes like nominal, ordinal, interval, and ratio attributes. It also describes different types of data sets like records, documents, transactions, graphs, and ordered data. Finally, it discusses important characteristics of structured data and various data quality issues like noise, missing values, and duplicates.

Uploaded by

f20220226
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views33 pages

CS F415 L2 - Data

The document describes data and various data types. It defines data as a collection of objects and their attributes. It lists different types of attributes like nominal, ordinal, interval, and ratio attributes. It also describes different types of data sets like records, documents, transactions, graphs, and ordered data. Finally, it discusses important characteristics of structured data and various data quality issues like noise, missing values, and duplicates.

Uploaded by

f20220226
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 33

BITS Pilani

BITS Pilani Prof.Aruna Malapati


Hyderabad Campus Department of CSIS
BITS Pilani
Hyderabad Campus

Data
Today’s Learning objective

• Describe Data

• List various Data types

• List the issues in Data quality

• List and identify the right preprocessing techniques


given data

BITS Pilani, Hyderabad Campus


What is Data?
• Collection of data objects and their Attributes
attributes

• An attribute is a property or Tid Refund Marital Taxable


Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a person,
2 No Married 100K No
temperature, etc.
3 No Single 70K No
– Other names: variable, filed,
characteristic, feature, Predictor, 4 Yes Married 120K No

etc. 5 No Divorced 95K Yes

• A collection of attributes describe Objects 6 No Married 60K No


No
an object 7 Yes Divorced 220K
8 No Single 85K Yes
– Other names: record, point, case,
sample, entity, or instance 9 No Married 75K No
10 No Single 90K Yes
10

BITS Pilani, Hyderabad Campus


Attribute Values

BITS Pilani, Hyderabad Campus


Types of Attributes

Hair color

Car prices (low, medium,


High)

Has finite or countably


infinite set of values
Ex: terms in doc

Has real numbers


Ex: length, Weight, temp etc.

BITS Pilani, Hyderabad Campus


Properties of Attribute
Values
• The type of an attribute depends on which of the following
properties it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: */

– Nominal attribute: distinctness


– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties

BITS Pilani, Hyderabad Campus


Attribute Description Examples Operations
Type

Nominal The values of a nominal attribute are just zip codes, employee mode, entropy,
different names, i.e., nominal attributes ID numbers, eye contingency
provide only enough information to color, sex: {male, correlation, 2
distinguish one object from another. (=, )female} test

Ordinal The values of an ordinal hardness of minerals, median,


attribute provide enough {good, better, best}, percentiles, rank
information to order objects. grades, street correlation, run
numbers
(<, >) tests, sign tests
Interval For interval attributes, the calendar dates, mean, standard
differences between values are temperature in deviation, Pearson's
meaningful, i.e., a unit of Celsius or correlation, t and F
measurement exists.
Fahrenheit tests
(+, - )
Ratio For ratio variables, both temperature in
differences and ratios are Kelvin, monetary geometric mean,
meaningful. (*, /) quantities, counts, harmonic mean,
age, mass, length, percent
electrical current variation
BITS Pilani, Hyderabad Campus
Discrete and Continuous
Attributes

BITS Pilani, Hyderabad Campus


Types of data sets

• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

BITS Pilani, Hyderabad Campus


Important Characteristics of
Structured Data

– Dimensionality

• Curse of Dimensionality

– Sparsity

• Only presence counts

– Resolution

• Patterns depend on the scale

BITS Pilani, Hyderabad Campus


Record Data

• Data that consists of a collection of records, each of


which consists of a fixed set of attributes

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

BITS Pilani, Hyderabad Campus


Data Matrix

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

BITS Pilani, Hyderabad Campus


Document Data

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

BITS Pilani, Hyderabad Campus


Transaction Data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

BITS Pilani, Hyderabad Campus


Graph Data
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
2 <li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5

BITS Pilani, Hyderabad Campus


Chemical Data

Benzene Molecule: C6H6

BITS Pilani, Hyderabad Campus


Ordered Data
Sequences of transactions

BITS Pilani, Hyderabad Campus


Ordered Data
Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

BITS Pilani, Hyderabad Campus


Ordered Data

Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

BITS Pilani, Hyderabad Campus


Data Quality

• What kinds of data quality problems?

• How can we detect problems with the data?

• What can we do about these problems?


• Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data

BITS Pilani, Hyderabad Campus


Noise

Two Sine Waves Two Sine Waves + Noise


BITS Pilani, Hyderabad Campus
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

BITS Pilani, Hyderabad Campus


Missing Values

BITS Pilani, Hyderabad Campus


Duplicate Data

BITS Pilani, Hyderabad Campus


Data Preprocessing

• Aggregation

• Sampling

• Dimensionality Reduction

• Feature subset selection

• Feature creation

• Discretization and Binarization

• Attribute Transformation

BITS Pilani, Hyderabad Campus


Aggregation

BITS Pilani, Hyderabad Campus


Aggregation
Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
BITS Pilani, Hyderabad Campus
Sampling

BITS Pilani, Hyderabad Campus


Sampling

• The key principle for effective sampling is:

• A sample will work almost as well as using the entire

data set if the sample is representative(different for

different data set).

• Sampling may remove outliers and if done improperly

can introduce noise.

BITS Pilani, Hyderabad Campus


Types of Sampling (done mostly with libraries)

• Simple Random Sampling


– There is an equal probability of selecting any particular item

• Sampling without replacement


– As each item is selected, it is removed from the population

• Sampling with replacement


– Objects are not removed from the population as they are selected for
the sample.
• In sampling with replacement, the same object can be picked up more
than once

• Stratified sampling
– Split the data into several partitions; then draw random samples from
each partition

BITS Pilani, Hyderabad Campus


Sample Size

8000 points 2000 Points 500 Points

BITS Pilani, Hyderabad Campus


Take home message
• Four different features/attributes/measurements/

independent variables can be of type Nominal, Ordinal,

Interval or Ratio type.


• Based on the type of data, the operations vary.

• The data set can be of the record, graph, or ordered type.

• Real-world data is dirty, so preprocessing is a very important step in Data


Mining.
• There are several methods for preprocessing, choosing the right method
depends on the problem and data obtained.

BITS Pilani, Hyderabad Campus

You might also like