Unit 1
Unit 1
Data Mining:
Data mining is a technology that blends traditional data analysis
methods with sophisticated algorithms for processing large volumes
of data.
Data mining is the process of automatically discovering useful
information in large data repositories.
Data mining techniques are deployed to scour large databases in
order to find novel and useful patterns that might otherwise remain
unknown. They also provide capabilities to predict the outcome.
However, extracting useful information has proven extremely
challenging. Often, traditional data analysis tools and techniques
cannot be used because of the massive size of a data set.
Sometimes, the non-traditional nature of the data means that
traditional approaches cannot be applied even if the data set is
relatively small.
Data Mining and Knowledge Discovery:
Data mining is an integral part of knowledge discovery in databases
(KDD), which is the overall process of converting raw data into useful
information.
This process consists of a series of transformation steps, from data
preprocessing to postprocessing of data mining results.
The input data can be stored in a variety of formats (flat files,
spreadsheets, or relational tables) and may reside in a centralized
data repository or be distributed across multiple sites.
The purpose of pre-processing is to transform the raw input data into
an appropriate format for subsequent analysis.The steps involved in
data preprocessing include fusing data from multiple sources,
cleaning data to remove noise and duplicate observations, and
selecting records and features that are relevant to the data mining
task at hand.
Because of the many ways data can be collected and stored, data
preprocessing is perhaps the most laborious and time-consuming
step in the overall knowledge discovery process.
Motivating Challenges:
1)Scalability:
Because of advances in data generation and collection, data sets with
sizes of gigabytes, terabytes, or even petabytes are becoming
common. lf data mining algorithms are to handle these massive data
sets, then they must be scalable. Scalability can also be improved by
using sampling or developing parallel and distributed algorithms.
2) High Dimensionality:
It is now common to encounter data sets with hundreds or
thousands of attributes instead of the handful common a few
decades ago. Data sets with temporal or spatial components also
tend to have high dimensionality.
For example, consider a data set that contains measurements of
temperature at various locations. If the temperature measurements
are taken repeatedly for an extended period, the number of
dimensions (features) increases.
Traditional data analysis techniques that were developed for low-
dimensional data often do not work well for such highdimensional
data.
3) Heterogeneous and Complex Data:
Traditional data analysis methods often deal with data sets
containing attributes of the same type, either continuous or
categorical. As the role of data mining in business, science, medicine,
and other fields has grown, so has the need fo r techniques that can
handle heterogeneous attributes.
Examples of such non-traditional types of data include collections of
Web pages containing semi-structured text and hyperlinks; DNA data
and climate data that consists of time series measurements.
4) Data Ownership and Distribution:
Sometimes, the data needed for an analysis is not stored in one
location or owned by one organization. Instead, the data is
geographically distributed among resources belonging to multiple
entities. This requires the development of distributed data mining
techniques.
5) Non-traditional Analysis:
The traditional statistical approach is based on a hypothesize-and-
test paradigm. ln other words, a hypothesis is proposed , an
experiment is designed to gather the data, and then the data is
analyzed with respect to the hypothesis. Unfortunately, this process
is extremely laborintensive.
Current data analysis tasks often require the generation and
evaluation of thousands of hypotheses, and consequently, the
development of some data mining techniques has been motivated by
the desire to automate the process of hypothesis generation and
evaluation.
The origins of Data Mining:
The goal of meeting the challenges ,researchers from different
disciplines began to focus on developing more efficient and scalable
tools that could handle diverse types of data.
This work, which culminated in the field of data mining, built
upon the methodology and algorithms that researchers had
previously used. In particular, data mining draws upon ideas, such as
sampling, estimation, and hypothesis testing from statistics and
search algorithms, modeling techniques, and learning theories from
artificial intelligence, pattern recognition, and machine learning.
Data mining has also been quick to adopt ideas from other
areas, including optimization, evolutionary computing, information
theory, signal processing, visualization, and information retrieval. A
number of other areas also play key supporting roles.
In particular, database systems are needed to provide support
for efficient storage, indexing, and query processing.
Techniques from high performance (parallel) computing are
often important in addressing the massive size of some data sets.
Distributed techniques can also help address the issue of size
and are essential when the data cannot be gathered in one location.
Data Mining Tasks:
Data mining tasks are generally divided into two major categories:
1)Predictive tasks:
The objective of these tasks is to predict the value of a
particular attribute based on the values of other attributes. The
attribute to be predicted is commonly known as the target or
dependent variable, while the attributes used for making the
prediction are known as the explanatory or independent variables.
2)Descriptive tasks:
Here, the objective is to derive patterns (correlations, trends,
clusters, trajectories, and anomalies) that summarize the underlying
relationships in data. Descriptive data mining tasks are often
exploratory in nature a nd frequently require postprocessing
techniques to validate and explain the results.
Predictive modeling:
It refers to the task of building a model for the target variable
as a function of the explanatory variables. There are two types of
predictive modeling tasks: classification, which is used for discrete
target variables, and regression, which is used for continuous target
variables.Ex: Predicting the Type of a Flower
Association analysis:
It is used to discover patterns that describe strongly associated
features in the data. The discovered patterns are typically
represented in the form of implication rules or feature subsets.Ex:
Market Basket Analysis
Cluster analysis:
It seeks to find groups of closely related observations so that
observations that belong to the same cluster are more similar to
each other.Ex: Document Clustering
Anomaly detection:
It is the task of identifying observations whose characteristics
are significantly different from the rest of the data. Such
observations are known as anomalies or outliers. The goal of an
anomaly detection algorithm is to discover the real anomalies and
avoid falsely labeling normal objects as anomalous. Ex: Credit Card
Fraud Detection
Unit-1: Part-2:
What is data:
• Collection of data objects and their attributes
• An attribute is a property or characteristic of an object
• Examples: eye color of a person, temperature, etc.
• Attribute is also known as variable, field, characteristic,
dimension, or feature
• A collection of attributes describe an object
• Object is also known as record, point, case, sample, entity,
or instance.
Attribute values:
• Attribute values are numbers or symbols assigned to an
attribute for a particular object
• Distinction between attributes and attribute values
“I see our purchases are very similar since we didn’t buy most of
the same things.”
Characteristics of data:
• Dimensionality (number of attributes)
• High dimensional data brings a number of challenges
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Size
Type of analysis may depend on size of data
•
Types of datasets:
Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
Record data:
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Data matrix:
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute
• Such a data set can be represented by an m by n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute
Document data:
• Each document becomes a ‘term’ vector
• Each term is a component (attribute) of the vector
• The value of each component is the number of times the
corresponding term occurs in the document.
Transaction data:
• A special type of data, where
• Each transaction involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip
constitute a transaction, while the individual products that
were purchased are the items.
• Can represent transaction data as record data
Graph data:
Ordered data:
Data Quality:
Many characteristics act as a deciding factor for data
quality, such as incompleteness and incoherent information,
which are common properties of the big database in the real
world.
Preventing data quality problems is typically not an option,
data mining focuses on the detection and correction of data quality
problems and the use of algorithms that can tolerate poor data
quality. The first step, detection and correction, is often called data
cleaning.
Measurement and Data Collection Issues:
Measurement and Data Collection Errors:
The term measureme nt error refers to any problem resulting
from the measurement process. A common problem is that the value
recorded differs from the true value to some extent. The term data
collection error refers to errors such as omitting data objects or attri
bute values, or inappropriately including a data object.
Noise and Artifacts Noise:
It is the random component of a measurement error. It may
involve the distortion of a value or the addition of spurious objects.
Precision, Bias, and Accuracy:
Precision: The closeness of repeated measurements of the
same quantity to one another.
Bias: A systematic variation of measurements from the
quantity being measured.
Accuracy: The closeness of measurements to t.he true value of
the quantity being measured.
Outliers:
Outliers are either data objects that, in some sense, have
characteristics that are different from most of the other data objects
in the data set, or values of an attribute that are unusual with re
spect to the typical values for that attribute.
Missing Values:
It is not unusual for an object to be missing one or more
attribute values. In some cases, the information was not
collected.The techniques used for dealing missing data:
Eliminate Data Objects or Attributes:
A simple and effective strategy is to eliminate objects with
missing values.
Estimate Missing Values:
Sometimes missing data can be reliably estimatedthe missing
values can be estimated by using the remai ing values.
Ignore the Missing Value during Analysis:
Many data mining approaches can be modified to
ignoremissing values. lf one or both objects of a pair have
missing values for some attributes, then the similari ty can be
calculated by using only the attributes that do not have missing
values.
Inconsistent Values:
Data can contain inconsistent values. Some types of
inconsistences are easy to detect and some are not. Once an
inconsistency has been detected, it is sometimes possible to correct
the data.
Duplicate Data:
A data set may include data objects that are duplicates, or
almost duplicates, of one another.
Data Preprocessing:
Data preprocessing is a broad area and consists of a number of
different strategies and techniques that are interrelated in complex
ways. The purpose of pre-processing is to transform the raw input
data into an appropriate format for subsequent analysis.
• Aggregation
• Sampling
• Dimensionality reduction
• Feature subset selection
• Feature creation
• Discretization and binarization
• Attribute transformation
Aggregation:
Combining two or more attributes (or objects) into a single
attribute (or object)
Purpose:
• Data reduction - reduce the number of attributes or
objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
• More “stable” data - aggregated data tends to have less
variability.
Sample size:
Feature Creation:
• Create new attributes that can capture the important
information in a data set much more efficiently than the original
attributes
• Three general methodologies:
• Feature extraction
• Example: extracting edges from images
• Feature construction
• Example: dividing mass by volume to get density
• Mapping data to new space
Example: Fourier and wavelet analysis
Discretization and binarization:
Discretization:
Discretization is the process of converting a continuous attribute
into an ordinal attribute.
• A potentially infinite number of values are mapped into a
small number of categories
• Discretization is used in both unsupervised and supervised
settings
• Unsupervised Discretization
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce
overlap.
Supervised Discretization:
Binarization:
It maps a continuous or categorical attribute into one or more
binary variables.
Attribute transform:
• An attribute transform is a function that maps the entire set of
values of a given attribute to a new set of replacement values
such that each old value can be identified with one of the new
values
• Simple functions: xk, log(x), ex, |x|
• Normalization
• Refers to various techniques to adjust to differences
among attributes in terms of frequency of
occurrence, mean, variance, range
• Take out unwanted, common signal, e.g.,
seasonality
• In statistics, standardization refers to subtracting off the
means and dividing by the standard deviation
Unit-1:Part-3:
MEASURES OF SIMILARITY AND DISSIMILARITY:
The similarity between two objects is a numerical measure of
the degree to which the two objects are alike.Similarities are higher
for pairs of objects that are more alike. Similarities are usually non-
negative and are often between 0 (no similarity) and 1 (complete
similarity).
The dissimilarity between two objects is a numerical measure
of the degree to which the two objects are different. Dissimilarities
are lower for more similar pairs of objects. Dissimilarities sometimes
fall in the interval [0, I], but it is also common for them to range from
0 to oo.
Similarity and Dissimilarity between Simple Attributes:
The proximity of objects with a number of attributes is typically
defined by combining the proximities of individual attributes, and
thus, we first discuss proximity between objects having a single
attribute.
Ex: