0% found this document useful (0 votes)
4 views15 pages

IDS Unit 2 Additional Topics

Uploaded by

AI&DS VGNT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views15 pages

IDS Unit 2 Additional Topics

Uploaded by

AI&DS VGNT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

1
Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes


 Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!

2
Examples of Data mining Applications

1. Fraud detection: credit cards, phone cards


2. Marketing: customer targeting
3. Data Warehousing: Walmart
4. Astronomy
5. Molecular biology
Origins of Data Mining

 Draws ideas from machine learning/AI, pattern recognition,


statistics, and database systems

 Must address: AI /
Statistics
 Enormity of data Machine Learning
 High dimensionality

of data Data Mining


 Heterogeneous,

distributed nature
of data Database
systems
Knowledge Discovery (KDD) Process

Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
5
Types of Data Sets
 Record
 Relational records
Data matrix

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
 Document data: text documents.
 Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2

 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0

 World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0


 Social or information networks

TID Items
 Ordered
1 Bread, Coke, Milk
 Video data: sequence of images
 Temporal data: time-series
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image and Video data:

6
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects, tuples .
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.

7
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal

 Binary

 Ordinal

 Numeric:

 quantitative
 Interval-scaled
 Ratio-scaled
8
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between successive values is not
known.
 Size = {small, medium, large}, grades, army rankings

9
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 Ratio
 Inherent zero-point
 Ratio Data is having the same properties as interval data, with
an equal and definitive ratio between each data and absolute
“zero”
 In other words, there can be no negative numerical value in
ratio data.
 e.g., temperature in Kelvin, length, counts.

10
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a collection of

documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete attributes

 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and represented using

a finite number of digits


 Continuous attributes are typically represented as floating-point

variables

11
Input statement in R

12
Output statement in R

13
Statistical Functions in R
S. No Function Description Example
1. mean(x, trim=0, It is used to find the mean for x object a<-c(0:10, 40)
na.rm=FALSE) xm<-mean(a)
print(xm)
Output[1] 7.916667
2. sd(x) It returns standard deviation of an a<-c(0:10, 40)
object. xm<-sd(a)
print(xm)
Output[1] 10.58694
3. median(x) It returns median. a<-c(0:10, 40)
xm<-meadian(a)
print(xm)
Output[1] 5.5
4. quantilie(x, probs) It returns quantile where x is the
numeric vector whose quantiles are
desired and probs is a numeric vector
with probabilities in [0, 1]

August 7, 2024
14
Statistical Functions in R
S. No Function Description Example
5. range(x) It returns range. a<-c(0:10, 40)
xm<-range(a)
print(xm)
Output[1] 0 40
6. sum(x) It returns sum. a<-c(0:10, 40) xm<-sum(a)
print(xm)
Output[1] 95
7. diff(x, lag=1) It returns differences with a<-c(0:10, 40) xm<-diff(a)
lag indicating which print(xm)
lag to use. Output[1] 1 1 1 1 1 1 1 1 1 1 30
8. min(x) It returns minimum value. a<-c(0:10, 40) xm<-min(a)
print(xm)
Output[1] 0
9. max(x) It returns maximum value a<-c(0:10, 40) xm<-max(a)
print(xm)
Output[1] 40
August 7, 2024
15

You might also like