0% found this document useful (0 votes)
17 views

02data Part1

This chapter discusses getting to know your data by understanding its characteristics. It covers data types like nominal, binary, ordinal, interval-scaled and ratio-scaled attributes. Discrete attributes have a finite set of values while continuous attributes have real number values. Understanding attribute types, distributions, and visualizing the data helps in preprocessing tasks like handling missing values and outliers.

Uploaded by

baigsalman251
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

02data Part1

This chapter discusses getting to know your data by understanding its characteristics. It covers data types like nominal, binary, ordinal, interval-scaled and ratio-scaled attributes. Discrete attributes have a finite set of values while continuous attributes have real number values. Understanding attribute types, distributions, and visualizing the data helps in preprocessing tasks like handling missing values and outliers.

Uploaded by

baigsalman251
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Data Mining

Dr. Shahid Mahmood Awan

http://turing.cs.pub.ro/mas_11
curs.cs.pub.ro
[email protected]
University of Management and Technology

Spring 2016
Data Mining:
Concepts and Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights reserved.
2
Data Types, Data Statistics, Data Visualization

GETTING TO KNOW YOUR


DATA
4
Data
 This chapter is about getting familiar with your data.
 Knowledge about your data is useful for data preprocessing
 What are the types of attributes or fields that make up your data?
 What kind of values does each attribute have?
 Which attributes are discrete, and which are continuous-valued?
 What do the data look like?
 How are the values distributed?
 Are there ways we can visualize the data to get a better sense of it all?
 Can we spot any outliers?
 Can we measure the similarity of some data objects with respect to
others?

5
 Knowing such basic statistics regarding each attribute makes it easier to fill in
 missing values,
 smooth noisy values,
 and spot outliers during data preprocessing.
 Knowledge of the attributes and attribute values can also help in fixing
inconsistencies incurred during data integration.
 Plotting the measures of central tendency shows us if the data are symmetric
or skewed.
 Quantile plots, histograms, and scatter plots are other graphic displays of
basic statistical descriptions.
 These can help identify relations, trends, and biases “hidden” in unstructured
data sets.
6
Data First, Data Mining Later

7
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

8
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix, crosstabs

timeout

season
coach

game
score
team

ball

lost
pla
Document data: text documents: term-frequency

wi
n
y

vector
 Transaction data Document 1 3 0 5 0 2 6 0 2 0 2
 Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
 Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0

 Molecular Structures
 Ordered
 Video data: sequence of images TID Items
 Temporal data: time-series 1 Bread, Coke, Milk
 Sequential Data: transaction sequences 2 Beer, Bread
 Genetic sequence data
3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia:
4 Beer, Bread, Diaper, Milk
 Spatial data: maps
5 Coke, Diaper, Milk
 Image data:
 Video data:
9
Important Characteristics of Structured Data

 Dimensionality
 Curse of dimensionality
 Sparsity
 Only presence counts
 Resolution

Patterns depend on the scale
 Distribution
 Centrality and dispersion

10
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.

11
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address
 observations.
 Univariate, bivariate,
 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

12
Attribute Types
 Nominal: Relating to Names
 categories, states, or “names of things”

 Hair_color = {auburn, black, blond, brown, grey, red, white}

 marital status = {single, married, divorced, and widowed}

 Occupation = {teacher, dentist, programmer, farmer}

 ID numbers, zip codes

13
Attribute Types
 Binary

 Nominal attribute with only 2 states (0 and 1)

 Symmetric binary: both outcomes equally important


 e.g., gender

 Asymmetric binary: outcomes not equally important.


 e.g., medical test (positive vs. negative)

 Convention: assign 1 to most important outcome (e.g., HIV


positive), the rarest one,

14
Attribute Types
 Ordinal

 Values have a meaningful order (ranking) but magnitude between


successive values is not known.

 Drink Size = {small, medium, large},


 Shirt Size = {small, medium, large},
 Class grades,
 Army rankings

 The values of such qualitative attributes are typically


words representing categories.

15
Numeric Attribute Types
 Measurable Quantity (integer or real-valued)

 Numeric attributes can be interval-scaled or ratio-scaled.

 Interval Scaled

 Measured on a scale of equal-sized units


 Values have order
 Can be positive, 0 , or negative
 E.g., temperature in C˚or F˚, calendar dates
 temperature of 20˚C is five degrees higher than
a temperature of 15˚C
 No true zero-point
16
Numeric Attribute Types
 Ratio-Scaled

 Inherent zero-point
 Start from ‘0’

 We can speak of values as being an order of


magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).

 e.g., temperature in Kelvin, length, counts,


monetary quantities
 You are 100 times richer with $100 than with
$1
17
Discrete vs. Continuous Attributes
 Discrete Attribute

 Has only a finite or countably infinite set of values


 E.g., zip codes, profession, or the set of words in a

collection of documents

 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete


attributes

18
Discrete vs. Continuous Attributes
 Continuous Attribute

 Has real numbers as attribute values


 E.g., temperature, height, or weight

 Practically, real values can only be measured and


represented using a finite number of digits

 Continuous attributes are typically represented as


floating-point variables

19

You might also like