IDS Unit 2 Additional Topics
IDS Unit 2 Additional Topics
1
Why Data Mining?
2
Examples of Data mining Applications
Must address: AI /
Statistics
Enormity of data Machine Learning
High dimensionality
distributed nature
of data Database
systems
Knowledge Discovery (KDD) Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
5
Types of Data Sets
Record
Relational records
Data matrix
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document data: text documents.
Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
TID Items
Ordered
1 Bread, Coke, Milk
Video data: sequence of images
Temporal data: time-series
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
Spatial data: maps 5 Coke, Diaper, Milk
Image and Video data:
6
Data Objects
7
Attributes
Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Ordinal
Numeric:
quantitative
Interval-scaled
Ratio-scaled
8
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between successive values is not
known.
Size = {small, medium, large}, grades, army rankings
9
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
Ratio
Inherent zero-point
Ratio Data is having the same properties as interval data, with
an equal and definitive ratio between each data and absolute
“zero”
In other words, there can be no negative numerical value in
ratio data.
e.g., temperature in Kelvin, length, counts.
10
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
documents
Sometimes, represented as integer variables
Continuous Attribute
Has real numbers as attribute values
variables
11
Input statement in R
12
Output statement in R
13
Statistical Functions in R
S. No Function Description Example
1. mean(x, trim=0, It is used to find the mean for x object a<-c(0:10, 40)
na.rm=FALSE) xm<-mean(a)
print(xm)
Output[1] 7.916667
2. sd(x) It returns standard deviation of an a<-c(0:10, 40)
object. xm<-sd(a)
print(xm)
Output[1] 10.58694
3. median(x) It returns median. a<-c(0:10, 40)
xm<-meadian(a)
print(xm)
Output[1] 5.5
4. quantilie(x, probs) It returns quantile where x is the
numeric vector whose quantiles are
desired and probs is a numeric vector
with probabilities in [0, 1]
August 7, 2024
14
Statistical Functions in R
S. No Function Description Example
5. range(x) It returns range. a<-c(0:10, 40)
xm<-range(a)
print(xm)
Output[1] 0 40
6. sum(x) It returns sum. a<-c(0:10, 40) xm<-sum(a)
print(xm)
Output[1] 95
7. diff(x, lag=1) It returns differences with a<-c(0:10, 40) xm<-diff(a)
lag indicating which print(xm)
lag to use. Output[1] 1 1 1 1 1 1 1 1 1 1 30
8. min(x) It returns minimum value. a<-c(0:10, 40) xm<-min(a)
print(xm)
Output[1] 0
9. max(x) It returns maximum value a<-c(0:10, 40) xm<-max(a)
print(xm)
Output[1] 40
August 7, 2024
15