0% found this document useful (0 votes)
13 views77 pages

R21 DM Unit1

Uploaded by

Asif EE-010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views77 pages

R21 DM Unit1

Uploaded by

Asif EE-010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

PRINCIPLES OF DATA MINING

III Year – II Semester


Course Objectives
• To familiarize the concepts of data mining.
• To expose the design issues of supervised and
un-supervised learning algorithms.
UNIT - I: Data Mining
Introduction, Data Mining, Motivating
challenges, The origins of Data Mining,
DataMiningTasks, Types of Data, Data Quality.
Introduction
• Rapid advances in data collection and storage technology have
enabled organizations to accumulate vast amounts of data.
• However, extracting useful information has proven extremely
challenging.
• Often, traditional data analysis tools and techniques cannot be
used because of the massive size of a data set.
Data Mining and Knowledge Discovery
• Data mining is an integral part of knowledge discovery in
databases (KDD), which is the overall process of converting
raw data into useful information
• The input data can be stored in a variety of formats and may
reside in a centralized data repository or be distributed across
multiple sites.
• The purpose of preprocessing is to transform the raw input
data into an appropriate format for subsequent analysis.
• “Closing the loop” is the phrase often used to refer to the process
of integrating data mining results into decision support systems.
• Such integration requires a postprocessing step that ensures that
only valid and useful results are incorporated into the decision
support system.
Motivating Challenges

The following are some of the specific challenges that motivated


the development of data mining.
Scalability ::Because of advances in data generation and
collection, data sets with sizes of gigabytes, terabytes, or even
petabytes are becoming common.
• If data mining algorithms are to handle these massive data
sets, then they must be scalable.
• High Dimensionality It is now common to encounter data sets with
hundreds or thousands of attributes.
• Data sets with temporal or spatial components also tend to have
high dimensionality.
• For example, consider a data set that contains measurements of
temperature at various locations.
• Heterogeneous and Complex Data Traditional data analysis
methods often deal with data sets containing attributes of the same
type, either continuous or categorical.
• As the role of data mining in business, science, medicine, and
other fields has grown, so has the need for techniques that can
handle heterogeneous attributes.
• Data Ownership and Distribution Sometimes, the data needed
for an analysis is not stored in one location.
• Instead, the data is geographically distributed among resources
belonging to multiple entities.
• This requires the development of distributed data mining
techniques.
The Origins of Data Mining

• Researchers from different disciplines began to focus on


developing more efficient and scalable tools that could handle
diverse types of data.
• In particular, database systems are needed to provide support for
efficient storage, indexing, and query processing.
• Techniques from high performance (parallel) computing are often
important in addressing the massive size of some data sets.
• Distributed techniques can also help address the issue of size and
are essential when the data cannot be gathered in one location.
Data Mining Tasks
• Data mining tasks are generally divided into two major categories:
Predictive tasks.
• The objective of these tasks is to predict the value of a particular
attribute based on the values of other attributes.
• The attribute to be predicted is commonly known as the target or
dependent variable, while the attributes used for making the
prediction are known as the explanatory or independent
variables.
• Descriptive tasks. Here, the objective is to derive patterns
(correlations, clusters, and anomalies) that summarize the
underlying relationships in data.
• Predictive modeling refers to the task of building a model for
the target variable as a function of the explanatory variables.
• There are two types of predictive modeling tasks:
classification, which is used for discrete target variables, and
regression, which is used for continuous target variables.
Example::Predicting the Type of a Flower.
Consider the task of predicting a species of flower based on the
characteristics of the flower.
In particular, consider classifying an Iris flower as to whether it
belongs to one of the following three Iris species: Setosa,
Versicolour, or Virginica.
• In addition to the species of a flower, this data set contains four
other attributes: sepal width, sepal length, petal length, and petal
width.
• Petal width low and petal length low implies Setosa. Petal width
medium and petal length medium implies Versicolour. Petal width
high and petal length high implies Virginica.
• Association analysis is used to discover patterns that describe
strongly associated features in the data.
• The discovered patterns are typically represented in the form of
rules
• Cluster analysis seeks to find groups of closely related
observations so that observations that belong to the same
cluster are more similar to each other than observations that
belong to other clusters.
• Example (Document Clustering). The collection of news
articles shown in Table can be grouped based on their
respective topics.
• Each article is represented as a set of word-frequency pairs
(w, c), where w is a word and c is the number of times the
word appears in the article.
• Anomaly detection is the task of identifying observations
whose characteristics are significantly different from the rest
of the data.
• Such observations are known as anomalies or outliers.
• Credit Card Fraud Detection. A credit card company records
the transactions made by every credit card holder, along with
personal information such as credit limit, age, annual income,
and address.
Types of Data

• A data set can be viewed as a collection of data objects.


• data objects are described by a number of attributes that
capture the basic characteristics of an object.
• Other names for an attribute are variable, characteristic, field,
feature, or dimension.
Example (Student Information).
• A data set is a file, in which the objects are records (or rows) in
the file and each field (or column) corresponds to an attribute.
Attributes and Measurement
• An attribute is a property or characteristic of an object that may
vary, either from one object to another or from one time to
another.
• For example, eye color varies from person to person, while the
temperature of an object varies over time.
• A measurement scale is a rule (function) that associates a
numerical value with an attribute of an object.
The Type of an Attribute
• the values used to represent an attribute may have properties
that are not properties of the attribute itself, and vice versa.
• it is reasonable to talk about the average age of an employee,
it makes no sense to talk about the average employee ID.
The Different Types of Attributes
• The following properties (operations) of numbers are typically
used to describe attributes.

• Given these properties, we can define four types of attributes:


nominal, ordinal, interval, and ratio.
Attribute Type Description Examples Operations

Nominal The values of a zip codes, mode


nominal attribute employee ID
Categorical are different numbers,
(Qualitative) names

Ordinal The values of an {good, better, median


ordinal attribute best},
provide grades,
information to
order objects.

Numeric Interval the differences calendar dates mean


(Quantitative) between values
are meaningful

Ratio both differences counts, age, mean


and ratios are length
meaningful.
Transformations that define attribute levels
Describing Attributes by the Number of Values
• Discrete: A discrete attribute has a finite or infinite set of
values.
• Such attributes can be categorical, such as zip codes or ID
numbers.
• Binary attributes are discrete attributes and assume only two
values, e.g., true/false, yes/no, or 0/1.
• Continuous: A continuous attribute is one whose values are real
numbers.
• Examples include attributes such as temperature, height, or
weight.
Types of Data Sets
• There are many types of data sets, we have grouped the data sets
into three groups: record data, graph based data, and ordered data.
• General Characteristics of Data Sets are dimensionality, sparsity,
and resolution.
• Dimensionality :The dimensionality of a data set is the number
of attributes in the data set.
• The data set with high-dimensional data is referred to as the curse
of dimensionality.
• Sparsity:: For some data sets, such as asymmetric features, most
attributes of an object have values of 0.
• Resolution: It is possible to obtain data at different levels of
resolution.
• For instance, the surface of the Earth seems very uneven at a
resolution of few meters, but is relatively smooth at a resolution of
kilometers.
• Record Data:: Much data mining work assumes that the data
set is a collection of records , each of which consists of a fixed
set of attributes.
Transaction or Market Basket Data
• Transaction data is a special type of record data, where each record
(transaction) involves a set of items.
• Consider a grocery store, the products purchased by a customer
during one shopping trip is a transaction.
Transaction data.
• The Data Matrix: If the data objects have the same fixed set
of numeric attributes.
• The Sparse Data Matrix is a special case of a data matrix in
which the attributes are of the same type and are asymmetric;
i.e., only non-zero values are important.
Graph-Based Data
• A graph can is a powerful representation for data.

We consider two specific cases:


(1) the graph captures relationships among data objects and
(2) the data objects themselves are represented as graphs.
Linked Web pages.
• Data with Objects that Are Graphs If objects have structure, i.e
the objects contain sub objects that have relationships, then such
objects are represented as graphs.
• Ordered Data For some types of data, the attributes have
relationships that involve order in time or space.
• Sequential Data also referred to as temporal data, can be an
extension of record data, where each record has a time associated
with it.
• Sequence Data consists of a data set that is a sequence of
individual entities, such as a sequence of words or letters.

Genomic sequence data


• Time Series Data is a special type of sequential data in which
each record is a time series, i.e., a series of measurements taken
over time.

Temperature time series


• Spatial Data Some objects have spatial attributes, such as
positions or areas.
• An example of spatial data is weather data (temperature,
pressure) that is collected for a variety of geographical
locations.
Spatial temperature data
Data Quality

• Data quality is the measure of how well a data set serve its
specific purpose.
• The focus is on measurement and data collection issues.
Measurement and Data Collection Errors
• The term measurement error refers to any problem resulting from
the measurement process.
• A common problem is that the value recorded differs from the true
value to some extent.
• For continuous attributes, the numerical difference of the measured
and true value is called the error.
• The term data collection error refers to errors such as
omitting data objects or attribute values, or inappropriately
including a data object.
Noise and Artifacts
• Noise is the random component of a measurement error. It may
involve the distortion of a value or the addition of spurious
objects.
• Data errors such as a streak in the same place on a set of
photographs.
• Such deterministic distortions of the data are referred to as
artifacts.
• In statistics, the quality of the measurement process and the
resulting data are measured by precision and bias.
• Precision:: The closeness of repeated measurements (of the same
quantity) to one another.
• Bias:: A systematic variation of measurements from the quantity
being measured.
• Accuracy:: The closeness of measurements to the true value of the
quantity being measured.
Outliers
• Outliers are either (1) data objects that, have characteristics that
are different from the other data objects in the data set, or
(2) values of an attribute that are unusual with respect to the values
for that attribute.
Missing Values
• It is usual for an object to be missing one or more attribute values.
• In some cases, the information was not collected; e.g., some people
decline to give their age or weight.
There are several strategies for dealing with missing data
• Eliminate Data Objects or Attributes: A simple and effective
strategy is to eliminate objects with missing values.
• Estimate Missing Values Sometimes missing data can be reliably
estimated.
• For example, consider a time series that changes in a reasonably
smooth fashion, but has a few, widely scattered missing values.
• In such cases, the missing values can be estimated by using the
remaining values.
• Ignore the Missing Value during Analysis Many data mining
approaches can be modified to ignore missing values.
• For example, suppose that objects are being clustered and the
similarity between pairs of data objects needs to be calculated.
Inconsistent Values
• Data can contain inconsistent values. Consider an address field,
where both a zip code and city are listed, but the specified zip code
area is not contained in that city.
Duplicate Data
• A data set may include data objects that are duplicates, or almost
duplicates, of one another.
• Many people receive duplicate mailings because they appear in a
database multiple times under slightly different names.
Issues Related to Applications
• Data quality issues can also be considered from an application
viewpoint as expressed by the statement “data is of high quality.
• Timeliness Some data starts to age as soon as it has been collected.
• Relevance The available data must contain the information
necessary for the application.
• Consider the task of building a model that predicts the accident

rate for drivers.


• Knowledge about the Data, data sets should contain

documentation that describes different aspects of the data .


• Thank you

You might also like