0% found this document useful (0 votes)
25 views20 pages

Bca DM Unit I

Data mining

Uploaded by

vinoth kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views20 pages

Bca DM Unit I

Data mining

Uploaded by

vinoth kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Mining

Unit I

Data Mining
Data mining is the process of extraction of interesting patterns or required data from huge
amount of data. It is the set of activities used to find new, hidden, unexpected or unusual patterns
in data.
Knowledge discovery process
Knowledge Discovery from Data (KDD) as a process is depicted in Figure 1.4 and consists of an
iterative sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data)


2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present the mined knowledge to the user)
Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining.
The data mining step may interact with the user or a knowledge base.
Various kinds of Data / Types of data :
Data mining should be applicable to any kind of information repository.

Types of data are,

1.Flat Files

Flat files are actually the most common data source for data mining algorithms especially
at the research level. Flat files are simple data files in text format with a structure known by the
data mining algorithm to be applied.

2.Relational databases

A relational database consists of a set of tables containing either values of entity


attributes. Tables have columns and rows, where columns represent attributes and rows represent
tuples.

A tuple in a relational table corresponds to either an object or a relationship between


objects and is identified by a set of attribute values representing a unique key.
3.Data Warehouse

A Data Warehouse as a storehouse is a repository of data collected from multiple data


sources and is intended to be used as a whole under the same unified schema. A data warehouse
gives the option to analyze data from different sources under the same roof. Data from different
stores would be loaded, cleaned, transformed and integrated together.

4.Transaction Database

A transaction database is a set of records representing transactions, each with a time


stamp.

The transactional database may have additional tables associated with it, which contain
other information regarding the sale, such as the date of the transaction, the customer ID number,
and so on.

In this transaction databases analyze “Show me all the items purchased by Raman” or :
How many transaction include item number I3” and also used MBA( Market Basket Analysis)
analysis, it is used to find, which product is sales in combination with other product in frequently
Example Bread with milk.

5.Object relational databases

The object-relational data model inherits the essential concepts of object-oriented


databases, where, in general terms, each entity is considered as an object. Following the
AllElectronics example, objects can be individual employees, customers, or items. Data and code
relating to an object are encapsulated into a single unit. Each object has associated with it the
following:

A set of variables that describe the objects.


A set of messages that the object can use to communicate with other objects, or with the rest of
the database system.
A set of methods, where each method holds the code to implement a message. Ex: get_photo
(employee) will retrieve and return a photo of the given employee object.

6.Text and multimedia databases

Text databases are databases that contain word description for objects. These word
descriptions are not simple keywords but rather than long sentences or paragraphs, such as
product specifications error or bug reports, summary reports.

Multimedia databases store Text, Image, Audio, and Video data.

7.Spatial databases

Spatial databases are databases that, in addition to usual data, store geographical
information like maps, and global or regional positioning.

8. Temporal Databases, Sequence Databases, and Time-Series Databases


A temporal database typically stores relational data that include time-related attributes. It
compares the present data with previous data (Eg. Sales).

A sequence database stores sequences of ordered events, with or without a concrete


notion of time. Examples include customer shopping sequences.
A time-series database stores sequences of values or events obtained over repeated
measurements of time (e.g., hourly, daily, weekly).

9.World wide web (www)

The www is the most heterogeneous and dynamic repository available. A very large
number of authors and publishers are continuously contributing to its growth and massive
number of users are accessing its resources daily.
10. Heterogeneous databases

A heterogeneous database consists of a set of interconnected, autonomous component


databases. Objects in one component database may differ greatly from objects in other
components databases.

Kinds of Patterns (or) Data Mining Functionalities (or) Technologies used for
Data Mining:
Data Mining system is a tool to provide lot of functionality to mine our data in the
database. Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks.
Data mining tasks can be classified into two categories

1.Descriptive
It identify patterns in data. Descriptive mining tasks characterize the general
properties of the data in the database.
2.Predictive models
It predicts unknown values based on known data. Predictive mining tasks perform
inference on the current data in order to make predictions.
Data mining functionalities, and the kinds of patterns are described below.

1.Concept/Class Description: Characterization and Discrimination


 Data can be associated with classes or concepts.
For example, in the AllElectronics store, classes of items for sale include computers and
printers, and concepts of customers include bigSpenders and budgetSpenders.
These descriptions can be derived via
(1) data characterization
(2) data discrimination
(3) both data characterization and discrimination.
1. Data characterization –Summarization of the general characteristics or features of a
target class of data.
 For example, to study the characteristics of software products whose sales increased by
10% in the last year, the data collected by executing an SQL query.
 The output of data characterization can be presented in various forms. Examples include
pie charts, bar charts, multidimensional data cubes, and multidimensional tables.
2. Data discrimination – comparing the target class with one or a set of classes.
 For example, compare the general features of software products whose sales
increased by 10% in the last year with those whose sales decreased by at least 30%
during the same period.
3. both data characterization and discrimination
2. Mining Frequent Patterns, Associations, and Correlations
Frequent patterns are patterns that occur frequently in data.
 Frequent itemset: a set of items that frequently appear together in a transactional
data set (Example : milk and bread)
 Frequent subsequence : pattern that customers tend to purchase first a PC, followed
by a digital camera, and then a memory card, is a frequent subsequential pattern.
Customer purchase product A followed by a purchase of product B.
 Frequent substructure : refer to different structural forms, like graphs, trees which
may be combined with itemsets or subsequenses.

 Association analysis : find frequent patterns. E.g. sample analysis result.


Association rule:
E.g buys(X; “computer”))buys(X; “software”) [support = 1%; confidence = 50%]

X is a variable representing a customer. A confidence 50% means, if a customer buys a


computer, there is a 50% chance that she will buy software. 1% means that transactions analysis
showed that computer and software were purchased together .
Association rules are discarded , if they do not satisfy both minimum support threshold
and minimum confidence threshold .

3.Classification and prediction


 Classification
 It is the process of finding a model that describes and distinguishes data classes or
concepts, for the purpose of being able to use the model to predict the class of objects
whose class label is unknown.
 The derived model is based on the analysis of a set of training data (i.e., data objects
whose class label is known).
 The derived model can be represented in classification (IF-THEN) rules, decision
trees, ,or neural networks (Figure 1.10)
 Prediction
 predict missing or unavailable numerical data values
4.Cluster Analysis

 Class label is unknown: group data to form new classes


 Clusters of objects are based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity

5.Outlier Analysis

 A database may contain data objects that do not comply with the general behavior
or model of the data.
 Outliers are usually discarded as noise or exceptions.
 Useful for fraud detection.
 E.g. Detect purchases of extremely large amounts.
6. Evolution Analysis

 Data evolution analysis describes and model regularities or trends for objects
whose behavior changes over time.
 E.g. identify stock evolution regularities for overall stock and for the stocks of
particular companies.
.

Major issues in Data Mining /Data Mining Issues


Data mining systems depend on databases to supply the raw input and this raises
problems, such as the database tends to be dynamic, incomplete, noisy and large. The major
issues in data mining system are as follows :

1. Mining methodology and user interaction


2. Performance Issues
3. Issues relating to the diversity of data types.

1.Mining methodology and user interaction


Mining different kinds of knowledge in databases- Different users can be interested in
different kinds of knowledge, data mining should cover a wide spectrum of data analysis and
knowledge discovery tasks which may use the same database in different ways.

Interactive mining of knowledge at multiple levels of abstraction – allows users to focus the
search for patterns, patterns, providing and refining data mining requests based on returned
results to view data and discovered patterns at multiple granularities.

Incorporation of background knowledge – Background knowledge guides the discovery


process in concise terms at different levels of abstraction.

Data mining query languages and ad-hoc data mining – Data mining query language need to
be developed to allow users to describe ad hoc data mining tasks by facilitating the specification
of the relevant data.

Expression and visualization of data mining results – This requires the system to adopt
expressive knowledge representation techniques, such as graph, trees, tables, charts, etc.
.
Handling noise and incomplete data – Noise or data may confuse the process, causing the
knowledge model constructed to over fit the data.

Pattern evaluation : the interestingness problem – Data mining system can uncover thousand
of patterns. Many of the patterns discovered may be uninteresting to the given user.
2.Performance Issues

Efficiency and scalability of data mining algorithms – Extract information from a large
amount of data in databases.

Parallel, distributed and incremental mining methods – The large size of data bases, wide
distribution of data, high cost and the computational complexity of data mining methods lead to
the development of parallel and distributed data mining algorithm.

3.Issues relating to the diversity of data types


Handling relational and complex types of data – Specific data mining systems should be
constructed for mining specific kinds of data. Therefore, one may expect to have different data
mining systems for different kinds of data.

Mining information from heterogeneous database and global information systems (WWW)
– Data mining may help data regularities in multiple heterogeneous (different) databases that are
unlikely to be discovered by simple query systems and may improve information exchange and
interoperability in heterogeneous databases.
Issues related to applications and social impacts
 Application of discovered knowledge
 Intelligent query answering
 Process control and decision making

Data objects and Attribute types


Data objects are the essential part of a database. A data object represents the entity. Data
Objects are like a group of attributes of an entity. For example, a sales data object may represent
customers, sales, or purchases. When a data object is listed in a database they are called data
tuples.
Attribute
Data field that represents the characteristics or features of a data object. For a customer,
object attributes can be customer Id, address, etc. We can say that a set of attributes used to
describe a given object are known as attribute vector.
Type of attributes :
This is the First step of Data-preprocessing. We differentiate between different types of attributes
and then preprocess the data.

1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).


2. Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes – related to names: The values of a Nominal attribute are names of
things, some kind of symbols. Values of Nominal attributes represents some category or state
and that’s why nominal attribute also referred as categorical attributes and there is no order
(rank, position) among values of the nominal attribute.
Example :

2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected or
unaffected, true or false.
Symmetric: Both values are equally important (Gender).
Asymmetric: Both values are not equally important (Result).
3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful sequence
or ranking(order) between them, but the magnitude between values is not actually known, the
order of values that shows what is important but don’t indicate how important it is.

Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented
in integer or real values. Numerical attributes are of 2 types, interval, and ratio.
An interval attribute has values, whose differences are interpretable. Consider an example of
temperature in degrees Centigrade. If a day’s temperature of one day is twice of the other day we
cannot say that one day is twice as hot as another day.
A ratio attribute measurement is ratio, we can say of a value as being a multiple of another
value. The values are ordered, and we can also compute the difference between values.
2. Discrete : Discrete data have finite values it can be numerical and can also be in categorical
form. These attributes has finite or countably infinite set of values.
Example:

3. Continuous: Continuous data have an infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3.
Example :

Data Visualization
Data visualization is actually a set of data and information that are represented
graphically to make it easy and quick for user to understand. Tools of data visualization provide
data by using visual effects or elements such as a chart, graphs, and maps.

Characteristics of Effective Graphical Visual :

 It shows or visualizes data very clearly in an understandable manner.


 It encourages viewers to compare different datas.
 It focuses our mind, and keeps our eyes on message as human brain tends to focus on
visual data more than written data.
 It also helps in identifying area that needs more attention and improvement.

Categories of Data Visualization

Figure – Categories of Data Visualization


Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data where data
generally represents amount such as height, weight, age of a person, etc. Numerical data is
categorized into two categories :

Continuous Data
It can be narrowed or categorized (Example: Height measurements).

Discrete Data
This type of data is not “continuous” (Example: Number of cars).

The type of visualization techniques that are used to represent numerical data visualization is
Charts and Numerical Values. Examples are Pie Charts, Bar Charts, etc.

Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data where data
generally represents groups. It simply consists of categorical variables that are used to represent
characteristics such as a person’s ranking, a person’s gender, etc.

Binary Data
In this, classification is based on positioning (Example: Agrees or Disagrees).

Nominal Data
In this, classification is based on attributes (Example: Male or Female).

Ordinal Data
In this, classification is based on ordering of information (Example: Timeline ).

The type of visualization techniques that are used to represent categorical data is Graphics,
Diagrams, and Flowcharts. Examples are Venn Diagram, etc.

Measuring Data Similarity and Dissimilarity


Similarity

 It means numerical measure of how alike two data objects are similar.
 Its similarity value is higher when objects are more alike.
 Example: Two pen with same color, size and model all are similar

Dissimilarity (E.g distance)

 It means numerical measure of how different in two data objects.


 Lower value means when objects are more alike.
 Minimum dissimilarity is often 0.
 Proximity refers to a similarity or dissimilarity.
 Example : Two pen with different color and same size and same model. Differ only in
color.

Data Preprocessing
Data can be preprocessed to improve the quality of the data and consequently, of the mining
results and also improve the efficiency and ease of the mining process.

Why preprocess the data / Need for preprocessing


We need clean data to produce good results
 Data in the real world is dirty
Incomplete : lacking attribute values.
Noisy: Containing errors
Inconsistent: containing discrepancies in codes or names
 No quality data, no quality mining results
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data.
Data preprocessing Techniques / Major tasks in data preprocessing

The major tasks in data preprocessing are as follows:


Data cleaning
Data cleaning to “clean” the data by filling in missing values, smoothing noisy data, identifying
or removing outliers, and resolving inconsistencies.

Data integration
integrating multiple databases, data cubes, or files, that is, data integration.

Data transformation
Data transformation operations are normalization and aggregation that would contribute toward
the success of the mining process.

Data reduction
Obtains reduced representation of the data set that is much smaller in volume, but produces the
same or similar analytical results.

Data Discretization
Part of data reduction but with particular importance, especially for numerical data.

Data preprocessing is an important step in the knowledge discovery process, because quality
decisions must be based on quality data.
Data Cleaning
Data cleaning tasks are used to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
1.Missing Values
Data is not always available. For example, many tuples have no recorded value for
several attributes, such as customer income in sales data.
Missing data may be due to
(i) Equipment malfunction
(ii) Inconsistent with other recorded data and thus deleted
(iii) Data not entered due to misunderstanding.
Methods of handling missing data
 Ignore the tuple : This is usually done when the class label is missing. This method is
not very effective, unless the tuple contains several attributes with missing values.
 Fill in the missing value manually: Manually search for all missing values and replace
them with appropriate values. In general, this approach is time-consuming and may not
be feasible given a large data set with many missing values.
 Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label. Example: like “Unknown” , a new class.
 Use the attribute mean to fill in the missing value: For example, suppose that the
average income of AllElectronics customers is $56,000. Use this value to replace the
missing value for income.
 Use the attribute mean for all samples belonging to the same class as the given tuple:
replace the missing value with the average income value for customers.
 Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree .
2. Noisy Data
Noise is a random error or variance in a measured variable. Noisy data comes from the
process of data collection, data entry, data transmission.
Data smoothing techniques are listed below,
1. Binning
2. Regression
3. Clustering
1. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,”
that is, the values around it. The sorted values are distributed into a number of “buckets,”
or bins. Figure 2.11 illustrates some binning techniques.
 Partition into (equal-frequency) bins:
In this example, the data for price are first sorted and then partitioned into equal-
frequency bins of size 3 (i.e., each bin contains three values).
 Smoothing by bin means:
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.

 Smoothing by bin boundaries:


In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest boundary
value.
2.Regression: Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best” line to fit two attributes (or variables),
so that one attribute can be used to predict the other.
3. Clustering: Outliers may be detected by clustering, where similar values are organized into
groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered
outliers (Figure 2.12).

Data Integration
Data integration technique combines data from multiple data stores. In other words data
integration is the integration of multiple databases, data cubes, or flat files.
Issues to be considered in Data Integration
 Schema integration
 Reduction
 Detecting and resolving data value conflicts
Schema integration
Schema integration integrates metadata from different sources. How can we match
schema and objects from different sources?. This is the essence of the entity identification
problem.
Entity identification problem: identify real world entities from multiple data sources, e.g.,
A.Cust-id =B.Cust#
Reduction
Redundant data occur often when integration of multiple databases is done. Another way to
detect the same attribute is via redundancy detection. An attribute is redundant if it can be
derived from another attribute or a group of other attributes.
Detecting and resolving data value conflicts
A third important issue in data integration is the detection and resolution. For example, for the
same real world entity, attribute values from different sources are different.
Handling Redundant Data in Data Integration
 Redundant data occurs often when integration of multiple databases. The same attribute
may have different names in different databases.
 Redundant data may be able to be detected by correlational analysis
 Careful integration of the data from multiple sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed and quality.
Data Reduction
Definition : Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the original data.
Strategies for data reduction include the following:
1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube. Data cubes store multidimensional aggregated information.
Aggregation operations are applied to the data in the construction of a data cube.
2. Attribute subset selection, Data sets for analysis may contain hundreds of attributes,
many of which may be irrelevant to the mining task or redundant. Attribute subset selection
reduces the data set size by removing irrelevant or redundant attributes.

3.Dimensionality reduction, where encoding mechanisms are used to reduce the data set
size. In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or “compressed” representation of the original data.
Data reduction types :
Lossless : If the original data can be reconstructed from the compressed data without any
information loss, the data reduction is called lossless.
Lossy : If the original data can be reconstructed from the compressed data with any
information loss, the data reduction is called lossy.
Wavelet Transforms
The discrete wavelet transform(DWT) is a linear signal processing technique that, when
applied to a data vector X, transforms it to a numerically different vector, X0, of wavelet
coefficients. The two vectors are of the same length. When applying this technique to
data reduction.
4.Numerosity Reduction
Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation. This method use parametric or non-parametric models to
obtain smaller representations of the original data.

Data Transformation

Data transformation in terms of data mining is the process of changing the form or
structure of existing attributes. Data transformation can involve the following.
Smoothing, which works to remove noise from the data. Such techniques include binning,
regression, and clustering.

Aggregation, where summary or aggregation operations are applied to the data. For example, the
daily sales data may be aggregated so as to compute monthly and annual total amounts.

Generalization of the data, where low-level data are replaced by higher-level concepts through
the use of concept hierarchies. For example, values for numerical attributes, like age, may be
mapped to higher-level concepts, like youth, middle-aged, and senior.

Normalization, where the attribute data are scaled so as to fall within a smaller range, such as
-1:0 to 1:0, or 0:0 to 1:0.
Attribute construction where new attributes are constructed and added from the given set of
attributes to help the mining process.

Methods of Data Normalization


1.Min-Max normalization
2.z-score normalization
3.Normalization by decimal scaling
1. Min-Max normalization : performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute, A. Min-max
normalization maps a value, v, of A to v0 in the range [new minA;new maxA] by computing

Min-max normalization preserves the relationships among the original data values.
2.z-score normalization
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v0
by computing

where A and sA are the mean and standard deviation, respectively, of attribute A. This method of
normalization is useful when the actual minimum and maximum of attribute A are unknown.
3.Normalization by decimal scaling normalizes by moving the decimal point of values of
attribute A. The number of decimal points moved depends on the maximum absolute value of A.
A value, v, of A is normalized to v0 by computing

where j is the smallest integer such that Max(jv0j) < 1.

Data Discretization
The raw data are replaced by a smaller number of interval or concept labels. This
simplifies the original data and makes the mining more efficient.
Discretization : Reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
Discretization techniques
Discretization techniques can be categoried based on how the discretization is performed.
Discretization for numeric data
Binning
Clustering analysis…

You might also like