0% found this document useful (0 votes)
3 views28 pages

Unit 1

Data mining is a technology that combines traditional data analysis with advanced algorithms to extract useful information from large datasets, forming a key part of knowledge discovery in databases (KDD). The document discusses various challenges in data mining such as scalability, high dimensionality, and data heterogeneity, as well as the tasks involved including predictive and descriptive analysis. Additionally, it covers data quality issues, measurement errors, and preprocessing techniques essential for preparing data for analysis.

Uploaded by

pari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views28 pages

Unit 1

Data mining is a technology that combines traditional data analysis with advanced algorithms to extract useful information from large datasets, forming a key part of knowledge discovery in databases (KDD). The document discusses various challenges in data mining such as scalability, high dimensionality, and data heterogeneity, as well as the tasks involved including predictive and descriptive analysis. Additionally, it covers data quality issues, measurement errors, and preprocessing techniques essential for preparing data for analysis.

Uploaded by

pari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Unit-1: Part-1:

Data Mining:
Data mining is a technology that blends traditional data analysis
methods with sophisticated algorithms for processing large volumes
of data.
Data mining is the process of automatically discovering useful
information in large data repositories.
Data mining techniques are deployed to scour large databases in
order to find novel and useful patterns that might otherwise remain
unknown. They also provide capabilities to predict the outcome.
However, extracting useful information has proven extremely
challenging. Often, traditional data analysis tools and techniques
cannot be used because of the massive size of a data set.
Sometimes, the non-traditional nature of the data means that
traditional approaches cannot be applied even if the data set is
relatively small.
Data Mining and Knowledge Discovery:
Data mining is an integral part of knowledge discovery in databases
(KDD), which is the overall process of converting raw data into useful
information.
This process consists of a series of transformation steps, from data
preprocessing to postprocessing of data mining results.
The input data can be stored in a variety of formats (flat files,
spreadsheets, or relational tables) and may reside in a centralized
data repository or be distributed across multiple sites.
The purpose of pre-processing is to transform the raw input data into
an appropriate format for subsequent analysis.The steps involved in
data preprocessing include fusing data from multiple sources,
cleaning data to remove noise and duplicate observations, and
selecting records and features that are relevant to the data mining
task at hand.
Because of the many ways data can be collected and stored, data
preprocessing is perhaps the most laborious and time-consuming
step in the overall knowledge discovery process.
Motivating Challenges:
1)Scalability:
Because of advances in data generation and collection, data sets with
sizes of gigabytes, terabytes, or even petabytes are becoming
common. lf data mining algorithms are to handle these massive data
sets, then they must be scalable. Scalability can also be improved by
using sampling or developing parallel and distributed algorithms.
2) High Dimensionality:
It is now common to encounter data sets with hundreds or
thousands of attributes instead of the handful common a few
decades ago. Data sets with temporal or spatial components also
tend to have high dimensionality.
For example, consider a data set that contains measurements of
temperature at various locations. If the temperature measurements
are taken repeatedly for an extended period, the number of
dimensions (features) increases.
Traditional data analysis techniques that were developed for low-
dimensional data often do not work well for such highdimensional
data.
3) Heterogeneous and Complex Data:
Traditional data analysis methods often deal with data sets
containing attributes of the same type, either continuous or
categorical. As the role of data mining in business, science, medicine,
and other fields has grown, so has the need fo r techniques that can
handle heterogeneous attributes.
Examples of such non-traditional types of data include collections of
Web pages containing semi-structured text and hyperlinks; DNA data
and climate data that consists of time series measurements.
4) Data Ownership and Distribution:
Sometimes, the data needed for an analysis is not stored in one
location or owned by one organization. Instead, the data is
geographically distributed among resources belonging to multiple
entities. This requires the development of distributed data mining
techniques.
5) Non-traditional Analysis:
The traditional statistical approach is based on a hypothesize-and-
test paradigm. ln other words, a hypothesis is proposed , an
experiment is designed to gather the data, and then the data is
analyzed with respect to the hypothesis. Unfortunately, this process
is extremely laborintensive.
Current data analysis tasks often require the generation and
evaluation of thousands of hypotheses, and consequently, the
development of some data mining techniques has been motivated by
the desire to automate the process of hypothesis generation and
evaluation.
The origins of Data Mining:
The goal of meeting the challenges ,researchers from different
disciplines began to focus on developing more efficient and scalable
tools that could handle diverse types of data.
This work, which culminated in the field of data mining, built
upon the methodology and algorithms that researchers had
previously used. In particular, data mining draws upon ideas, such as
sampling, estimation, and hypothesis testing from statistics and
search algorithms, modeling techniques, and learning theories from
artificial intelligence, pattern recognition, and machine learning.
Data mining has also been quick to adopt ideas from other
areas, including optimization, evolutionary computing, information
theory, signal processing, visualization, and information retrieval. A
number of other areas also play key supporting roles.
In particular, database systems are needed to provide support
for efficient storage, indexing, and query processing.
Techniques from high performance (parallel) computing are
often important in addressing the massive size of some data sets.
Distributed techniques can also help address the issue of size
and are essential when the data cannot be gathered in one location.
Data Mining Tasks:
Data mining tasks are generally divided into two major categories:
1)Predictive tasks:
The objective of these tasks is to predict the value of a
particular attribute based on the values of other attributes. The
attribute to be predicted is commonly known as the target or
dependent variable, while the attributes used for making the
prediction are known as the explanatory or independent variables.
2)Descriptive tasks:
Here, the objective is to derive patterns (correlations, trends,
clusters, trajectories, and anomalies) that summarize the underlying
relationships in data. Descriptive data mining tasks are often
exploratory in nature a nd frequently require postprocessing
techniques to validate and explain the results.
Predictive modeling:
It refers to the task of building a model for the target variable
as a function of the explanatory variables. There are two types of
predictive modeling tasks: classification, which is used for discrete
target variables, and regression, which is used for continuous target
variables.Ex: Predicting the Type of a Flower
Association analysis:
It is used to discover patterns that describe strongly associated
features in the data. The discovered patterns are typically
represented in the form of implication rules or feature subsets.Ex:
Market Basket Analysis
Cluster analysis:
It seeks to find groups of closely related observations so that
observations that belong to the same cluster are more similar to
each other.Ex: Document Clustering
Anomaly detection:
It is the task of identifying observations whose characteristics
are significantly different from the rest of the data. Such
observations are known as anomalies or outliers. The goal of an
anomaly detection algorithm is to discover the real anomalies and
avoid falsely labeling normal objects as anomalous. Ex: Credit Card
Fraud Detection
Unit-1: Part-2:
What is data:
• Collection of data objects and their attributes
• An attribute is a property or characteristic of an object
• Examples: eye color of a person, temperature, etc.
• Attribute is also known as variable, field, characteristic,
dimension, or feature
• A collection of attributes describe an object
• Object is also known as record, point, case, sample, entity,
or instance.

Attribute values:
• Attribute values are numbers or symbols assigned to an
attribute for a particular object
• Distinction between attributes and attribute values

• Same attribute can be mapped to different attribute values


• Example: height can be measured in feet or meters
• Different attributes can be mapped to the same set of
values
• Example: Attribute values for ID and age are integers
• But properties of attribute can be different than the
properties of the values used to represent the attribute
Measurement of Length :
• The way you measure an attribute may not match the
attributes properties.
Types of Attributes :
• There are different types of attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height {tall, medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
• Ratio
•Examples: temperature in Kelvin, length, counts,
elapsed time (e.g., time to run a race)
Properties of Attribute Values:
• The type of an attribute depends on which of the following
properties/operations it possesses:
• Distinctness: = ≠
• Order: < >
• Differences are + -
meaningful :
• Ratios are * /
meaningful
• Nominal attribute: distinctness
• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & meaningful
differences
• Ratio attribute: all 4 properties/operations

Discrete and Continuous Attributes :


• Discrete Attribute

• Has only a finite or countably infinite set of values


• Examples: zip codes, counts, or the set of words in a
collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete
attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and
represented using a finite number of digits.
•Continuous attributes are typically represented as floating-
point variables.
Assymetric attributes:
• Only presence (a non-zero attribute value) is regarded as
important
• Words present in documents
• Items present in customer transactions
• If we met a friend in the grocery store would we ever say the
following?

“I see our purchases are very similar since we didn’t buy most of
the same things.”
Characteristics of data:
• Dimensionality (number of attributes)
• High dimensional data brings a number of challenges
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Size
Type of analysis may depend on size of data

Types of datasets:
Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
Record data:
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes

Data matrix:
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute
• Such a data set can be represented by an m by n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute

Document data:
• Each document becomes a ‘term’ vector
• Each term is a component (attribute) of the vector
• The value of each component is the number of times the
corresponding term occurs in the document.

Transaction data:
• A special type of data, where
• Each transaction involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip
constitute a transaction, while the individual products that
were purchased are the items.
• Can represent transaction data as record data

Graph data:

Ordered data:
Data Quality:
Many characteristics act as a deciding factor for data
quality, such as incompleteness and incoherent information,
which are common properties of the big database in the real
world.
Preventing data quality problems is typically not an option,
data mining focuses on the detection and correction of data quality
problems and the use of algorithms that can tolerate poor data
quality. The first step, detection and correction, is often called data
cleaning.
Measurement and Data Collection Issues:
Measurement and Data Collection Errors:
The term measureme nt error refers to any problem resulting
from the measurement process. A common problem is that the value
recorded differs from the true value to some extent. The term data
collection error refers to errors such as omitting data objects or attri
bute values, or inappropriately including a data object.
Noise and Artifacts Noise:
It is the random component of a measurement error. It may
involve the distortion of a value or the addition of spurious objects.
Precision, Bias, and Accuracy:
Precision: The closeness of repeated measurements of the
same quantity to one another.
Bias: A systematic variation of measurements from the
quantity being measured.
Accuracy: The closeness of measurements to t.he true value of
the quantity being measured.
Outliers:
Outliers are either data objects that, in some sense, have
characteristics that are different from most of the other data objects
in the data set, or values of an attribute that are unusual with re
spect to the typical values for that attribute.

Missing Values:
It is not unusual for an object to be missing one or more
attribute values. In some cases, the information was not
collected.The techniques used for dealing missing data:
Eliminate Data Objects or Attributes:
A simple and effective strategy is to eliminate objects with
missing values.
Estimate Missing Values:
Sometimes missing data can be reliably estimatedthe missing
values can be estimated by using the remai ing values.
Ignore the Missing Value during Analysis:
Many data mining approaches can be modified to
ignoremissing values. lf one or both objects of a pair have
missing values for some attributes, then the similari ty can be
calculated by using only the attributes that do not have missing
values.
Inconsistent Values:
Data can contain inconsistent values. Some types of
inconsistences are easy to detect and some are not. Once an
inconsistency has been detected, it is sometimes possible to correct
the data.
Duplicate Data:
A data set may include data objects that are duplicates, or
almost duplicates, of one another.
Data Preprocessing:
Data preprocessing is a broad area and consists of a number of
different strategies and techniques that are interrelated in complex
ways. The purpose of pre-processing is to transform the raw input
data into an appropriate format for subsequent analysis.
• Aggregation
• Sampling
• Dimensionality reduction
• Feature subset selection
• Feature creation
• Discretization and binarization
• Attribute transformation
Aggregation:
Combining two or more attributes (or objects) into a single
attribute (or object)
Purpose:
• Data reduction - reduce the number of attributes or
objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
• More “stable” data - aggregated data tends to have less
variability.

Standard Deviation of Average Monthly Precipitation


Standard Deviation of Average Yearly Precipitation
Sampling:
Sampling is the main technique employed for data reduction.It is
often used for both the preliminary investigation of the data and
the final data analysis.
Statisticians often sample because obtaining the entire set of data
of interest is too expensive or time consuming.
Sampling is typically used in data mining because processing the
entire set of data of interest is too expensive or time consuming.
• The key principle for effective sampling is the following:
Using a sample will work almost as well as using the entire
data set,if the sample is representative
A sample is representative if it has approximately the same
properties (of interest) as the original set of data

Sample size:

8000 points 5000 points 500 points


Types of sampling:
• Simple Random Sampling

• There is an equal probability of selecting any particular


item
• Sampling without replacement
• As each item is selected, it is removed from the
population
• Sampling with replacement
• Objects are not removed from the population as they
are selected for the sample.
• In sampling with replacement, the same object can be
picked up more than once
• Stratified sampling:
•Split the data into several partitions; then draw random
samples from each partition.
Dimentionality reduction:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data
mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise.
• Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Others: supervised and non-linear techniques
Cursue of dimentionality:
• When dimensionality increases, data becomes increasingly
sparse in the space that it occupies.
• Definitions of density and distance between points, which are
critical for clustering and outlier detection, become less
meaningful.
• Another way to reduce dimensionality of data.
Feature Subset Selection:
• Redundant features
• Duplicate much or all of the information contained in one
or more other attributes
• Example: purchase price of a product and the amount of
sales tax paid
• Irrelevant features
• Contain no information that is useful for the data mining
task at hand
• Example: students' ID is often irrelevant to the task of
predicting students' GPA
• Many techniques developed, especially for classification.

Feature Creation:
• Create new attributes that can capture the important
information in a data set much more efficiently than the original
attributes
• Three general methodologies:

• Feature extraction
• Example: extracting edges from images
• Feature construction
• Example: dividing mass by volume to get density
• Mapping data to new space
Example: Fourier and wavelet analysis
Discretization and binarization:
Discretization:
Discretization is the process of converting a continuous attribute
into an ordinal attribute.
• A potentially infinite number of values are mapped into a
small number of categories
• Discretization is used in both unsupervised and supervised
settings

• Unsupervised Discretization
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce
overlap.

Supervised Discretization:

Binarization:
It maps a continuous or categorical attribute into one or more
binary variables.
Attribute transform:
• An attribute transform is a function that maps the entire set of
values of a given attribute to a new set of replacement values
such that each old value can be identified with one of the new
values
• Simple functions: xk, log(x), ex, |x|
• Normalization
• Refers to various techniques to adjust to differences
among attributes in terms of frequency of
occurrence, mean, variance, range
• Take out unwanted, common signal, e.g.,
seasonality
• In statistics, standardization refers to subtracting off the
means and dividing by the standard deviation

Unit-1:Part-3:
MEASURES OF SIMILARITY AND DISSIMILARITY:
The similarity between two objects is a numerical measure of
the degree to which the two objects are alike.Similarities are higher
for pairs of objects that are more alike. Similarities are usually non-
negative and are often between 0 (no similarity) and 1 (complete
similarity).
The dissimilarity between two objects is a numerical measure
of the degree to which the two objects are different. Dissimilarities
are lower for more similar pairs of objects. Dissimilarities sometimes
fall in the interval [0, I], but it is also common for them to range from
0 to oo.
Similarity and Dissimilarity between Simple Attributes:
The proximity of objects with a number of attributes is typically
defined by combining the proximities of individual attributes, and
thus, we first discuss proximity between objects having a single
attribute.

Dissimilarities between Data Objects:


Distances:
Equation is generalized by the Minkowski distance metric:

where r is a parameter. The following are the three most common


examples of Minkowski distances.
r = l. City block (Manhattan) distance
r = 2. Euclidean distance
r = oo. Supremum distance. This is the maximum difference
between any attribute of the objects.
Distances, such as the Euclidean distance, have some well-known
properties. If d(x, y) is the distance between two points, x and y, then
the following properties hold.
1. Positivity
(a) d(x,x) >=0 for all x and y,
(b) d(x,y) = 0 only if x = y.
2. Symmetry
d(x,y) = d(y,x) for all x andy.
3. Triangle Inequality
d(x, z) <=d{x, y) + d(y , z) for all points x, y, and z

Ex:

Similarities between Data Objects:


For similarities, the triangle inequality (or the analogous
property) typically does not hold, but symmetry and positivity
typically do.
To be explicit, if s(x, y) is the similarity between points x and y ,
then the typical properties of similarities are the following:
1. s(x,y) = 1 only if x = y. (0 <=S s <=S 1)
2. s(x,y) = s(y,x) for all x and y. (Symmetry)

Examples of Proximity Measures:


provides specific examples of some similarity and dissimilarity
measures.
Similarity Measures for Binary Data:
Similarity measures between objects that contain only binary
attributes are called similarity coefficients, and typically have values
between 0 and 1. A value of 1 indicates that the two objects are
completely similar, while a value of 0 indicates that the objects are
not at all similar.
Let x and y be two objects that consist of n binary attributes.
The comparison of two such objects, i.e., two binary vectors, leads to
the following four quantities (frequencies):
f oo = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f1O = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Simple Matching Coefficient One commonly used similarity
coefficient is the simple matching coefficient (SMC) , which is defined
as :

Jaccard Coefficient Suppose that x andy are data objects that


represent two rows (two transactions) of a transaction matrix.
Cosine Similarity:
Documents are often represented as vectors, where each
attribute represents the frequency with which a particular term
(word) occurs in the document. Even though documents have
thousands or tens of thousands of attributes (terms), each document
is sparse since it has relatively few non-zero attributes.
The cosine similarity, defined next, is one of the most common
measure of document similarity. If x and y are two document
vectors, then
Issues in Proximity Calculation:
(1)how to handle the case in which attributes have different
scales and/or are correlated,
(2) how to calculate proximity between objects that are
composed of different types of attributes, e.g., quantitative and
qualitative,
(3) and how to handle proximity calculation when attributes
have different weights; i.e., when not all attributes contribute
equally to the proximity of objects.
Standardization and Correlation for Distance Measures:
A related issue is how to compute distance when there is
correlation between some of the attributes, perhaps in addition to
difrerences in the ranges of values. A generalization of Euclidean
distance, the Mahalanobis distance , is useful when attributes are
correlated, have different ranges of values.

Combining Similarities for Heterogeneous Attributes:


The previous definitions of similarity were based on approaches
that assumed all the attributes were of the same type. A general
approach is needed when the attributes are of different types. One
straightforward approach is to compute the similarity between each
attribute separately and then combine these similarities using a
method that results in a similarity between 0 and l.
Selecting the Right Proximity Measure:
The following are a few general observations that may be
helpful. First, the type of proximity measure should fit the type of
data. For many types of dense, continuous data, metric distance
measures such as Euclidean distance are often used.
Proximity between continuous attributes is most often
expressed in terms of differences, and distance measures provide a
well-defined way of combining these differences into an overall
proximity measure. Although attributes can have different scales and
be of differing importance.
For sparse data, which often consists of asymmetric attributes,
we typically employ similarity measures that ignore 0- 0 matches.
ln some cases, transformation or normalization of t he data is
important for obtaining a proper similarity measure since such
transformations are not always present in proximity measures.

You might also like