0% found this document useful (0 votes)

13 views77 pages

R21 DM Unit1

Uploaded by

Asif EE-010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views77 pages

R21 DM Unit1

Uploaded by

Asif EE-010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 77

PRINCIPLES OF DATA MINING

III Year – II Semester

Course Objectives
• To familiarize the concepts of data mining.
• To expose the design issues of supervised and
un-supervised learning algorithms.
UNIT - I: Data Mining
Introduction, Data Mining, Motivating
challenges, The origins of Data Mining,
DataMiningTasks, Types of Data, Data Quality.
Introduction
• Rapid advances in data collection and storage technology have
enabled organizations to accumulate vast amounts of data.
• However, extracting useful information has proven extremely
challenging.
• Often, traditional data analysis tools and techniques cannot be
used because of the massive size of a data set.
Data Mining and Knowledge Discovery
• Data mining is an integral part of knowledge discovery in
databases (KDD), which is the overall process of converting
raw data into useful information
• The input data can be stored in a variety of formats and may
reside in a centralized data repository or be distributed across
multiple sites.
• The purpose of preprocessing is to transform the raw input
data into an appropriate format for subsequent analysis.
• “Closing the loop” is the phrase often used to refer to the process
of integrating data mining results into decision support systems.
• Such integration requires a postprocessing step that ensures that
only valid and useful results are incorporated into the decision
support system.
Motivating Challenges

The following are some of the specific challenges that motivated

the development of data mining.
Scalability ::Because of advances in data generation and
collection, data sets with sizes of gigabytes, terabytes, or even
petabytes are becoming common.
• If data mining algorithms are to handle these massive data
sets, then they must be scalable.
• High Dimensionality It is now common to encounter data sets with
hundreds or thousands of attributes.
• Data sets with temporal or spatial components also tend to have
high dimensionality.
• For example, consider a data set that contains measurements of
temperature at various locations.
• Heterogeneous and Complex Data Traditional data analysis
methods often deal with data sets containing attributes of the same
type, either continuous or categorical.
• As the role of data mining in business, science, medicine, and
other fields has grown, so has the need for techniques that can
handle heterogeneous attributes.
• Data Ownership and Distribution Sometimes, the data needed
for an analysis is not stored in one location.
• Instead, the data is geographically distributed among resources
belonging to multiple entities.
• This requires the development of distributed data mining
techniques.
The Origins of Data Mining

• Researchers from different disciplines began to focus on

developing more efficient and scalable tools that could handle
diverse types of data.
• In particular, database systems are needed to provide support for
efficient storage, indexing, and query processing.
• Techniques from high performance (parallel) computing are often
important in addressing the massive size of some data sets.
• Distributed techniques can also help address the issue of size and
are essential when the data cannot be gathered in one location.
Data Mining Tasks
• Data mining tasks are generally divided into two major categories:
Predictive tasks.
• The objective of these tasks is to predict the value of a particular
attribute based on the values of other attributes.
• The attribute to be predicted is commonly known as the target or
dependent variable, while the attributes used for making the
prediction are known as the explanatory or independent
variables.
• Descriptive tasks. Here, the objective is to derive patterns
(correlations, clusters, and anomalies) that summarize the
underlying relationships in data.
• Predictive modeling refers to the task of building a model for
the target variable as a function of the explanatory variables.
• There are two types of predictive modeling tasks:
classification, which is used for discrete target variables, and
regression, which is used for continuous target variables.
Example::Predicting the Type of a Flower.
Consider the task of predicting a species of flower based on the
characteristics of the flower.
In particular, consider classifying an Iris flower as to whether it
belongs to one of the following three Iris species: Setosa,
Versicolour, or Virginica.
• In addition to the species of a flower, this data set contains four
other attributes: sepal width, sepal length, petal length, and petal
width.
• Petal width low and petal length low implies Setosa. Petal width
medium and petal length medium implies Versicolour. Petal width
high and petal length high implies Virginica.
• Association analysis is used to discover patterns that describe
strongly associated features in the data.
• The discovered patterns are typically represented in the form of
rules
• Cluster analysis seeks to find groups of closely related
observations so that observations that belong to the same
cluster are more similar to each other than observations that
belong to other clusters.
• Example (Document Clustering). The collection of news
articles shown in Table can be grouped based on their
respective topics.
• Each article is represented as a set of word-frequency pairs
(w, c), where w is a word and c is the number of times the
word appears in the article.
• Anomaly detection is the task of identifying observations
whose characteristics are significantly different from the rest
of the data.
• Such observations are known as anomalies or outliers.
• Credit Card Fraud Detection. A credit card company records
the transactions made by every credit card holder, along with
personal information such as credit limit, age, annual income,
and address.
Types of Data

• A data set can be viewed as a collection of data objects.

• data objects are described by a number of attributes that
capture the basic characteristics of an object.
• Other names for an attribute are variable, characteristic, field,
feature, or dimension.
Example (Student Information).
• A data set is a file, in which the objects are records (or rows) in
the file and each field (or column) corresponds to an attribute.
Attributes and Measurement
• An attribute is a property or characteristic of an object that may
vary, either from one object to another or from one time to
another.
• For example, eye color varies from person to person, while the
temperature of an object varies over time.
• A measurement scale is a rule (function) that associates a
numerical value with an attribute of an object.
The Type of an Attribute
• the values used to represent an attribute may have properties
that are not properties of the attribute itself, and vice versa.
• it is reasonable to talk about the average age of an employee,
it makes no sense to talk about the average employee ID.
The Different Types of Attributes
• The following properties (operations) of numbers are typically
used to describe attributes.

• Given these properties, we can define four types of attributes:

nominal, ordinal, interval, and ratio.
Attribute Type Description Examples Operations

Nominal The values of a zip codes, mode

nominal attribute employee ID
Categorical are different numbers,
(Qualitative) names

Ordinal The values of an {good, better, median

ordinal attribute best},
provide grades,
information to
order objects.

Numeric Interval the differences calendar dates mean

(Quantitative) between values
are meaningful

Ratio both differences counts, age, mean

and ratios are length
meaningful.
Transformations that define attribute levels
Describing Attributes by the Number of Values
• Discrete: A discrete attribute has a finite or infinite set of
values.
• Such attributes can be categorical, such as zip codes or ID
numbers.
• Binary attributes are discrete attributes and assume only two
values, e.g., true/false, yes/no, or 0/1.
• Continuous: A continuous attribute is one whose values are real
numbers.
• Examples include attributes such as temperature, height, or
weight.
Types of Data Sets
• There are many types of data sets, we have grouped the data sets
into three groups: record data, graph based data, and ordered data.
• General Characteristics of Data Sets are dimensionality, sparsity,
and resolution.
• Dimensionality :The dimensionality of a data set is the number
of attributes in the data set.
• The data set with high-dimensional data is referred to as the curse
of dimensionality.
• Sparsity:: For some data sets, such as asymmetric features, most
attributes of an object have values of 0.
• Resolution: It is possible to obtain data at different levels of
resolution.
• For instance, the surface of the Earth seems very uneven at a
resolution of few meters, but is relatively smooth at a resolution of
kilometers.
• Record Data:: Much data mining work assumes that the data
set is a collection of records , each of which consists of a fixed
set of attributes.
Transaction or Market Basket Data
• Transaction data is a special type of record data, where each record
(transaction) involves a set of items.
• Consider a grocery store, the products purchased by a customer
during one shopping trip is a transaction.
Transaction data.
• The Data Matrix: If the data objects have the same fixed set
of numeric attributes.
• The Sparse Data Matrix is a special case of a data matrix in
which the attributes are of the same type and are asymmetric;
i.e., only non-zero values are important.
Graph-Based Data
• A graph can is a powerful representation for data.

We consider two specific cases:

(1) the graph captures relationships among data objects and
(2) the data objects themselves are represented as graphs.
Linked Web pages.
• Data with Objects that Are Graphs If objects have structure, i.e
the objects contain sub objects that have relationships, then such
objects are represented as graphs.
• Ordered Data For some types of data, the attributes have
relationships that involve order in time or space.
• Sequential Data also referred to as temporal data, can be an
extension of record data, where each record has a time associated
with it.
• Sequence Data consists of a data set that is a sequence of
individual entities, such as a sequence of words or letters.

Genomic sequence data

• Time Series Data is a special type of sequential data in which
each record is a time series, i.e., a series of measurements taken
over time.

Temperature time series

• Spatial Data Some objects have spatial attributes, such as
positions or areas.
• An example of spatial data is weather data (temperature,
pressure) that is collected for a variety of geographical
locations.
Spatial temperature data
Data Quality

• Data quality is the measure of how well a data set serve its
specific purpose.
• The focus is on measurement and data collection issues.
Measurement and Data Collection Errors
• The term measurement error refers to any problem resulting from
the measurement process.
• A common problem is that the value recorded differs from the true
value to some extent.
• For continuous attributes, the numerical difference of the measured
and true value is called the error.
• The term data collection error refers to errors such as
omitting data objects or attribute values, or inappropriately
including a data object.
Noise and Artifacts
• Noise is the random component of a measurement error. It may
involve the distortion of a value or the addition of spurious
objects.
• Data errors such as a streak in the same place on a set of
photographs.
• Such deterministic distortions of the data are referred to as
artifacts.
• In statistics, the quality of the measurement process and the
resulting data are measured by precision and bias.
• Precision:: The closeness of repeated measurements (of the same
quantity) to one another.
• Bias:: A systematic variation of measurements from the quantity
being measured.
• Accuracy:: The closeness of measurements to the true value of the
quantity being measured.
Outliers
• Outliers are either (1) data objects that, have characteristics that
are different from the other data objects in the data set, or
(2) values of an attribute that are unusual with respect to the values
for that attribute.
Missing Values
• It is usual for an object to be missing one or more attribute values.
• In some cases, the information was not collected; e.g., some people
decline to give their age or weight.
There are several strategies for dealing with missing data
• Eliminate Data Objects or Attributes: A simple and effective
strategy is to eliminate objects with missing values.
• Estimate Missing Values Sometimes missing data can be reliably
estimated.
• For example, consider a time series that changes in a reasonably
smooth fashion, but has a few, widely scattered missing values.
• In such cases, the missing values can be estimated by using the
remaining values.
• Ignore the Missing Value during Analysis Many data mining
approaches can be modified to ignore missing values.
• For example, suppose that objects are being clustered and the
similarity between pairs of data objects needs to be calculated.
Inconsistent Values
• Data can contain inconsistent values. Consider an address field,
where both a zip code and city are listed, but the specified zip code
area is not contained in that city.
Duplicate Data
• A data set may include data objects that are duplicates, or almost
duplicates, of one another.
• Many people receive duplicate mailings because they appear in a
database multiple times under slightly different names.
Issues Related to Applications
• Data quality issues can also be considered from an application
viewpoint as expressed by the statement “data is of high quality.
• Timeliness Some data starts to age as soon as it has been collected.
• Relevance The available data must contain the information
necessary for the application.
• Consider the task of building a model that predicts the accident

rate for drivers.

• Knowledge about the Data, data sets should contain

documentation that describes different aspects of the data .

• Thank you

Full
No ratings yet
Full
367 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
DM 2 Part 1
No ratings yet
DM 2 Part 1
50 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
Attributes
No ratings yet
Attributes
66 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Data
No ratings yet
Data
84 pages
Data Mining Chapter 2 Notes
No ratings yet
Data Mining Chapter 2 Notes
87 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
Unit 1
No ratings yet
Unit 1
28 pages
DWDM Reference Notes
No ratings yet
DWDM Reference Notes
126 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Data Mining
No ratings yet
Data Mining
15 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
CAC 428 Topic 1 - Introduction To Data
No ratings yet
CAC 428 Topic 1 - Introduction To Data
24 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Chapter 2.1 2.2
No ratings yet
Chapter 2.1 2.2
40 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
DM Unit-1
No ratings yet
DM Unit-1
14 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Data Mining Unit I Notes
No ratings yet
Data Mining Unit I Notes
29 pages
Unit I Notes
No ratings yet
Unit I Notes
23 pages
Unit 2
No ratings yet
Unit 2
37 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
No ratings yet
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
11 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Data Mining For Exam
No ratings yet
Data Mining For Exam
10 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Data Mining
No ratings yet
Data Mining
87 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Microsoft Word: Microsoft Official Academic Course
No ratings yet
Microsoft Word: Microsoft Official Academic Course
210 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Manual EPLAN - Manual Software Eplan P8 - Iniciante
100% (1)
Manual EPLAN - Manual Software Eplan P8 - Iniciante
141 pages
GMAT Integrated Reasoning
No ratings yet
GMAT Integrated Reasoning
12 pages
Research Paper On Zomato
50% (2)
Research Paper On Zomato
3 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
Auditing Notes by Rehan Farhat ISA 300
No ratings yet
Auditing Notes by Rehan Farhat ISA 300
21 pages
Addition of Integers
No ratings yet
Addition of Integers
6 pages
Baldovino t3d - Lab 7 and 8
No ratings yet
Baldovino t3d - Lab 7 and 8
50 pages
Web Design Proposal
100% (1)
Web Design Proposal
15 pages
Snapdeal MIS
No ratings yet
Snapdeal MIS
16 pages
Ce6306 Strength of Materials Ii/Iii Mechanical Engineering
No ratings yet
Ce6306 Strength of Materials Ii/Iii Mechanical Engineering
29 pages
Strick Pack Dominador
No ratings yet
Strick Pack Dominador
9 pages
Quadcopter With Arduino Uno Running MultiWii
No ratings yet
Quadcopter With Arduino Uno Running MultiWii
5 pages
Co PDF
No ratings yet
Co PDF
123 pages
Fodor's Essential Scandinavia: The Best of Norway, Sweden, Denmark, Finland, and Iceland 3rd Edition Michelle Arrouas All Chapters Instant Download
100% (1)
Fodor's Essential Scandinavia: The Best of Norway, Sweden, Denmark, Finland, and Iceland 3rd Edition Michelle Arrouas All Chapters Instant Download
41 pages
Unit 6 Challenges
No ratings yet
Unit 6 Challenges
8 pages
LTE Frequency Bands
No ratings yet
LTE Frequency Bands
6 pages
Library Management Project 96ec
No ratings yet
Library Management Project 96ec
25 pages
2080iq4 2
No ratings yet
2080iq4 2
2 pages
RR1720 User Manual PDF
No ratings yet
RR1720 User Manual PDF
71 pages
Whats New in PLS-CADD Handout
No ratings yet
Whats New in PLS-CADD Handout
6 pages
Cse291d 2 PDF
No ratings yet
Cse291d 2 PDF
54 pages
ADC0831/ADC0832/ADC0834 and ADC0838 8-Bit Serial I/O A/D Converters With Multiplexer Options
No ratings yet
ADC0831/ADC0832/ADC0834 and ADC0838 8-Bit Serial I/O A/D Converters With Multiplexer Options
33 pages
Nishant Resume
No ratings yet
Nishant Resume
2 pages
Windows 7 Regal Business Edition 2014 SP1
No ratings yet
Windows 7 Regal Business Edition 2014 SP1
1 page
05 eLMS Activity 1 Victoriano Joshua P
No ratings yet
05 eLMS Activity 1 Victoriano Joshua P
2 pages
Some Pointers About HTML
No ratings yet
Some Pointers About HTML
3 pages
Run The System File Checker Tool
No ratings yet
Run The System File Checker Tool
5 pages
OUTPUT#5
No ratings yet
OUTPUT#5
2 pages

R21 DM Unit1

Uploaded by

R21 DM Unit1

Uploaded by

PRINCIPLES OF DATA MINING

III Year – II Semester

The following are some of the specific challenges that motivated

• Researchers from different disciplines began to focus on

• A data set can be viewed as a collection of data objects.

• Given these properties, we can define four types of attributes:

Nominal The values of a zip codes, mode

Ordinal The values of an {good, better, median

Numeric Interval the differences calendar dates mean

Ratio both differences counts, age, mean

We consider two specific cases:

Genomic sequence data

Temperature time series

rate for drivers.

documentation that describes different aspects of the data .

You might also like