Bca DM Unit I
Bca DM Unit I
Unit I
Data Mining
Data mining is the process of extraction of interesting patterns or required data from huge
amount of data. It is the set of activities used to find new, hidden, unexpected or unusual patterns
in data.
Knowledge discovery process
Knowledge Discovery from Data (KDD) as a process is depicted in Figure 1.4 and consists of an
iterative sequence of the following steps:
1.Flat Files
Flat files are actually the most common data source for data mining algorithms especially
at the research level. Flat files are simple data files in text format with a structure known by the
data mining algorithm to be applied.
2.Relational databases
4.Transaction Database
The transactional database may have additional tables associated with it, which contain
other information regarding the sale, such as the date of the transaction, the customer ID number,
and so on.
In this transaction databases analyze “Show me all the items purchased by Raman” or :
How many transaction include item number I3” and also used MBA( Market Basket Analysis)
analysis, it is used to find, which product is sales in combination with other product in frequently
Example Bread with milk.
Text databases are databases that contain word description for objects. These word
descriptions are not simple keywords but rather than long sentences or paragraphs, such as
product specifications error or bug reports, summary reports.
7.Spatial databases
Spatial databases are databases that, in addition to usual data, store geographical
information like maps, and global or regional positioning.
The www is the most heterogeneous and dynamic repository available. A very large
number of authors and publishers are continuously contributing to its growth and massive
number of users are accessing its resources daily.
10. Heterogeneous databases
Kinds of Patterns (or) Data Mining Functionalities (or) Technologies used for
Data Mining:
Data Mining system is a tool to provide lot of functionality to mine our data in the
database. Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks.
Data mining tasks can be classified into two categories
1.Descriptive
It identify patterns in data. Descriptive mining tasks characterize the general
properties of the data in the database.
2.Predictive models
It predicts unknown values based on known data. Predictive mining tasks perform
inference on the current data in order to make predictions.
Data mining functionalities, and the kinds of patterns are described below.
5.Outlier Analysis
A database may contain data objects that do not comply with the general behavior
or model of the data.
Outliers are usually discarded as noise or exceptions.
Useful for fraud detection.
E.g. Detect purchases of extremely large amounts.
6. Evolution Analysis
Data evolution analysis describes and model regularities or trends for objects
whose behavior changes over time.
E.g. identify stock evolution regularities for overall stock and for the stocks of
particular companies.
.
Interactive mining of knowledge at multiple levels of abstraction – allows users to focus the
search for patterns, patterns, providing and refining data mining requests based on returned
results to view data and discovered patterns at multiple granularities.
Data mining query languages and ad-hoc data mining – Data mining query language need to
be developed to allow users to describe ad hoc data mining tasks by facilitating the specification
of the relevant data.
Expression and visualization of data mining results – This requires the system to adopt
expressive knowledge representation techniques, such as graph, trees, tables, charts, etc.
.
Handling noise and incomplete data – Noise or data may confuse the process, causing the
knowledge model constructed to over fit the data.
Pattern evaluation : the interestingness problem – Data mining system can uncover thousand
of patterns. Many of the patterns discovered may be uninteresting to the given user.
2.Performance Issues
Efficiency and scalability of data mining algorithms – Extract information from a large
amount of data in databases.
Parallel, distributed and incremental mining methods – The large size of data bases, wide
distribution of data, high cost and the computational complexity of data mining methods lead to
the development of parallel and distributed data mining algorithm.
Mining information from heterogeneous database and global information systems (WWW)
– Data mining may help data regularities in multiple heterogeneous (different) databases that are
unlikely to be discovered by simple query systems and may improve information exchange and
interoperability in heterogeneous databases.
Issues related to applications and social impacts
Application of discovered knowledge
Intelligent query answering
Process control and decision making
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected or
unaffected, true or false.
Symmetric: Both values are equally important (Gender).
Asymmetric: Both values are not equally important (Result).
3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful sequence
or ranking(order) between them, but the magnitude between values is not actually known, the
order of values that shows what is important but don’t indicate how important it is.
Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented
in integer or real values. Numerical attributes are of 2 types, interval, and ratio.
An interval attribute has values, whose differences are interpretable. Consider an example of
temperature in degrees Centigrade. If a day’s temperature of one day is twice of the other day we
cannot say that one day is twice as hot as another day.
A ratio attribute measurement is ratio, we can say of a value as being a multiple of another
value. The values are ordered, and we can also compute the difference between values.
2. Discrete : Discrete data have finite values it can be numerical and can also be in categorical
form. These attributes has finite or countably infinite set of values.
Example:
3. Continuous: Continuous data have an infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3.
Example :
Data Visualization
Data visualization is actually a set of data and information that are represented
graphically to make it easy and quick for user to understand. Tools of data visualization provide
data by using visual effects or elements such as a chart, graphs, and maps.
Continuous Data
It can be narrowed or categorized (Example: Height measurements).
Discrete Data
This type of data is not “continuous” (Example: Number of cars).
The type of visualization techniques that are used to represent numerical data visualization is
Charts and Numerical Values. Examples are Pie Charts, Bar Charts, etc.
Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data where data
generally represents groups. It simply consists of categorical variables that are used to represent
characteristics such as a person’s ranking, a person’s gender, etc.
Binary Data
In this, classification is based on positioning (Example: Agrees or Disagrees).
Nominal Data
In this, classification is based on attributes (Example: Male or Female).
Ordinal Data
In this, classification is based on ordering of information (Example: Timeline ).
The type of visualization techniques that are used to represent categorical data is Graphics,
Diagrams, and Flowcharts. Examples are Venn Diagram, etc.
It means numerical measure of how alike two data objects are similar.
Its similarity value is higher when objects are more alike.
Example: Two pen with same color, size and model all are similar
Data Preprocessing
Data can be preprocessed to improve the quality of the data and consequently, of the mining
results and also improve the efficiency and ease of the mining process.
Data integration
integrating multiple databases, data cubes, or files, that is, data integration.
Data transformation
Data transformation operations are normalization and aggregation that would contribute toward
the success of the mining process.
Data reduction
Obtains reduced representation of the data set that is much smaller in volume, but produces the
same or similar analytical results.
Data Discretization
Part of data reduction but with particular importance, especially for numerical data.
Data preprocessing is an important step in the knowledge discovery process, because quality
decisions must be based on quality data.
Data Cleaning
Data cleaning tasks are used to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
1.Missing Values
Data is not always available. For example, many tuples have no recorded value for
several attributes, such as customer income in sales data.
Missing data may be due to
(i) Equipment malfunction
(ii) Inconsistent with other recorded data and thus deleted
(iii) Data not entered due to misunderstanding.
Methods of handling missing data
Ignore the tuple : This is usually done when the class label is missing. This method is
not very effective, unless the tuple contains several attributes with missing values.
Fill in the missing value manually: Manually search for all missing values and replace
them with appropriate values. In general, this approach is time-consuming and may not
be feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label. Example: like “Unknown” , a new class.
Use the attribute mean to fill in the missing value: For example, suppose that the
average income of AllElectronics customers is $56,000. Use this value to replace the
missing value for income.
Use the attribute mean for all samples belonging to the same class as the given tuple:
replace the missing value with the average income value for customers.
Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree .
2. Noisy Data
Noise is a random error or variance in a measured variable. Noisy data comes from the
process of data collection, data entry, data transmission.
Data smoothing techniques are listed below,
1. Binning
2. Regression
3. Clustering
1. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,”
that is, the values around it. The sorted values are distributed into a number of “buckets,”
or bins. Figure 2.11 illustrates some binning techniques.
Partition into (equal-frequency) bins:
In this example, the data for price are first sorted and then partitioned into equal-
frequency bins of size 3 (i.e., each bin contains three values).
Smoothing by bin means:
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.
Data Integration
Data integration technique combines data from multiple data stores. In other words data
integration is the integration of multiple databases, data cubes, or flat files.
Issues to be considered in Data Integration
Schema integration
Reduction
Detecting and resolving data value conflicts
Schema integration
Schema integration integrates metadata from different sources. How can we match
schema and objects from different sources?. This is the essence of the entity identification
problem.
Entity identification problem: identify real world entities from multiple data sources, e.g.,
A.Cust-id =B.Cust#
Reduction
Redundant data occur often when integration of multiple databases is done. Another way to
detect the same attribute is via redundancy detection. An attribute is redundant if it can be
derived from another attribute or a group of other attributes.
Detecting and resolving data value conflicts
A third important issue in data integration is the detection and resolution. For example, for the
same real world entity, attribute values from different sources are different.
Handling Redundant Data in Data Integration
Redundant data occurs often when integration of multiple databases. The same attribute
may have different names in different databases.
Redundant data may be able to be detected by correlational analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed and quality.
Data Reduction
Definition : Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the original data.
Strategies for data reduction include the following:
1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube. Data cubes store multidimensional aggregated information.
Aggregation operations are applied to the data in the construction of a data cube.
2. Attribute subset selection, Data sets for analysis may contain hundreds of attributes,
many of which may be irrelevant to the mining task or redundant. Attribute subset selection
reduces the data set size by removing irrelevant or redundant attributes.
3.Dimensionality reduction, where encoding mechanisms are used to reduce the data set
size. In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or “compressed” representation of the original data.
Data reduction types :
Lossless : If the original data can be reconstructed from the compressed data without any
information loss, the data reduction is called lossless.
Lossy : If the original data can be reconstructed from the compressed data with any
information loss, the data reduction is called lossy.
Wavelet Transforms
The discrete wavelet transform(DWT) is a linear signal processing technique that, when
applied to a data vector X, transforms it to a numerically different vector, X0, of wavelet
coefficients. The two vectors are of the same length. When applying this technique to
data reduction.
4.Numerosity Reduction
Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation. This method use parametric or non-parametric models to
obtain smaller representations of the original data.
Data Transformation
Data transformation in terms of data mining is the process of changing the form or
structure of existing attributes. Data transformation can involve the following.
Smoothing, which works to remove noise from the data. Such techniques include binning,
regression, and clustering.
Aggregation, where summary or aggregation operations are applied to the data. For example, the
daily sales data may be aggregated so as to compute monthly and annual total amounts.
Generalization of the data, where low-level data are replaced by higher-level concepts through
the use of concept hierarchies. For example, values for numerical attributes, like age, may be
mapped to higher-level concepts, like youth, middle-aged, and senior.
Normalization, where the attribute data are scaled so as to fall within a smaller range, such as
-1:0 to 1:0, or 0:0 to 1:0.
Attribute construction where new attributes are constructed and added from the given set of
attributes to help the mining process.
Min-max normalization preserves the relationships among the original data values.
2.z-score normalization
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v0
by computing
where A and sA are the mean and standard deviation, respectively, of attribute A. This method of
normalization is useful when the actual minimum and maximum of attribute A are unknown.
3.Normalization by decimal scaling normalizes by moving the decimal point of values of
attribute A. The number of decimal points moved depends on the maximum absolute value of A.
A value, v, of A is normalized to v0 by computing
Data Discretization
The raw data are replaced by a smaller number of interval or concept labels. This
simplifies the original data and makes the mining more efficient.
Discretization : Reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
Discretization techniques
Discretization techniques can be categoried based on how the discretization is performed.
Discretization for numeric data
Binning
Clustering analysis…