Unit-1 Notes
Unit-1 Notes
Data mining refers to extracting or “mining” knowledge from large amounts of data.
KDD PROCESS(Knowledge Discovery process):
•
• A data warehouse is usually modeled by a multidimensional data structure, called a data
cube, in which each dimension corresponds to an attribute or a set of attributes in the
schema, and each cell stores the value of some aggregate measure such as count or
sum(sales amount).
•
3)Transactional Data.
• In general, each record in a transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web page.
• A transaction typically includes a unique transaction identity number (trans ID) and a
list of the items making up the transaction, such as the items purchased in the
transaction.
•
4) Other Kinds of Data:
• Time-related or sequence data: e.g., historical records, stock exchange data, and time-
series and biological sequence data
• Data streams :e.g., video surveillance and sensor data, which are continuously
transmitted
• Spatial data :e.g., maps
• Engineering design data :e.g., the design of buildings, system components, or
integrated circuits
• Hypertext and multimedia data :including text, image, video, and audio data
• Graph and Networked data :e.g., social and information networks
• Web :a huge, widely distributed information repository made available by the Internet
•
• Whereas classification predicts categorical (discrete, unordered) labels, regression
models continuous-valued functions.
4) Cluster Analysis :
• Clustering can be used to generate class labels for a group of data.
• The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.
• That is, clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are rather dissimilar to objects in other
clusters. Each cluster so formed can be viewed as a class of objects, from which rules
can be derived.
•
5) Outlier Analysis
• A data set may contain objects that do not comply with the general behavior or model
of the data.
• These data objects are outliers. Many data mining methods discard outliers as noise or
exceptions.
• The analysis of outlier data is referred to as outlier analysis or anomaly mining.
Interestingness Patterns:
• A data mining system has the potential to generate thousands or even millions of
patterns, or rules.
• A pattern is interesting if it is
1)easily understood by humans,
2) valid on new or test data with some degree of certainty
3) potentially useful, and (4) novel.
• A pattern is also interesting if it validates a hypothesis that the user sought to confirm.
An interesting pattern represents knowledge.
• Ex: for Association rule
buys(X, “computer”) ⇒ buys(X, “software”) [support = 40%,confidence = 50%]
• Because it is difficult to know exactly what can be discovered within a database, the
data mining process should be interactive.
iii) Incorporation of background knowledge:
• Background knowledge, or information regarding the domain under study, may be used
to guide the discovery process and allow discovered patterns to be expressed in concise
terms and at different levels of abstraction.
iv) Data mining query languages and ad hoc data mining:
• Relational query languages (such as SQL) allow users to pose ad hoc queries for data
retrieval. But Data mining Query Language cannot allow.
v) Presentation and visualization of data mining results:
• Discovered knowledge should be expressed in high-level languages, visual
representations, or other expressive forms so that the knowledge can be easily
understood and directly usable by humans.
vi) Handling noisy or incomplete data:
• The data stored in a database may reflect noise, exceptional cases, or incomplete data
objects.
vii) Pattern evaluation:
• The interestingness problem: A data mining system can uncover thousands of patterns.
• Many of the patterns discovered may be uninteresting to the given user
2. Performance issues:
i) Efficiency and scalability of data mining algorithms:
• To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.
ii) Parallel, distributed, and incremental mining algorithms:
• The huge size of many databases, the wide distribution of data, and the computational
complexity of some data mining methods are factors motivating the development of
parallel and distributed data mining algorithms.
3.Issues relating to the diversity of database types:
i) Handling of relational and complex types of data:
• Because relational databases and data warehouses are widely used, the development of
efficient and effective data mining systems for such data is important.
ii) Mining information from heterogeneous databases and global information
systems:
• Local- and wide-area computer networks (such as the Internet) connect many sources
of data, forming huge, distributed, and heterogeneous databases.
Data Preprocessing:
• Data preprocessing is a data mining technique which is used to transform the raw data
in a useful and efficient format.
Data Cleaning:
a)Missing values :
• Ignore the tuple
• Filling in the missing values manually
• use a global constant to fill in the missing value
• use the attribute mean to fill in the missing value
• use the attribute mean for all samples belonging to the same class as the given tuple
• Use the most probable value to fill in the missing value
b)Noisy data :
• Noise is a random error or variance in a measured variable. In order to remove the noise
data smoothing techniques are used
• Data Smoothing Techniques:
i) Binning: Binning methods smooth a sorted data value by consulting its
neighborhood i.e. the values around it.
Data Integration:
• Data integration issues:
i) Entity Identification Problem:
How can equivalent real-world entities from multiple data sources be matched up? This
is referred to as the entity identification problem.
For example, how can the data analyst or the computer be sure that customer id in
one database and cust number in another refer to the same attribute?
Examples of metadata for each attribute include the name, meaning, data type, and
range of values permitted for the attribute, and null rules for handling blank, zero,or
null values .
Such metadata can be used to help avoid errors in schema integration.
ii) Redundancy and Correlation Analysis:
An attribute may be redundant if it can be “derived” from another attribute or
set of attributes.
Inconsistencies in attribute or dimension naming can also cause redundancies in the
resulting data set
Some redundancies can be detected by correlation analysis. Given two attributes, such
analysis can measure how strongly one attribute implies the other, based on the available
data.
For nominal data, we use the χ 2 (chi-square) test. For numeric attributes, we can use the
correlation coefficient and covariance
iii)Tuple Duplication:
In addition to detecting redundancies between attributes, duplication should also be
detected at the tuple level .
The use of denormalized tables is another source of data redundancy.
Inconsistencies often arise between various duplicates, due to inaccurate data entry or
updating some but not all data occurrences.
Data Transformation:
In this preprocessing step, the data are transformed or consolidated so that the resulting
mining process may be more efficient, and the patterns found may be easier to
understand.
• Data Transformation Strategies:
i) Smoothing: which works to remove noise from the data. Techniques include
binning, regression, and clustering.
ii) Attribute construction : where new attributes are constructed and added from
the given set of attributes to help the mining process.
iii) Aggregation: where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts.
iv) Normalization : where the attribute data are scaled so as to fall within a smaller
range, such as −1.0 to 1.0, or 0.0 to 1.0.
There are many methods for data normalization . Some are
a) Min-max normalization:
Suppose that minA and maxA are the minimum and maximum values of an
attribute, A.
Min-max normalization maps a value, vi , of A to v!i in the range [new_minA, new
maxA]
b) z-score normalization (or zero-mean normalization):
Data Reduction:
Data Reduction techniques can be applied to obtain a reduced representation in volume but
produces the same or similar analytical results as original data.
Data reduction techniques :
1. Data cube aggregation : where aggregation operations are applied to the data in the
construction of a data cube.
Attribute subset selection: where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
1.Stepwise forward selection: The procedure starts with an empty set of attributes as
the reduced set. The best of the original attributes is determined and added to the
reduced set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.
3.Combination of forward selection and backward elimination: The stepwise
forward selection and backward elimination methods can be combined so that, at each
step, the procedure selects the best attribute and removes the worst from among the
remaining attributes.
4.Decision tree induction: Decision tree algorithms, such as ID3, C4.5, and CART,
were originally intended for classification. Decision tree induction constructs a flow
chart like structure where each internal (nonleaf) node denotes a test on an attribute,
each branch corresponds to an outcome of the test, and each external (leaf) node denotes
a class prediction. At each node, the algorithm chooses the “best” attribute to partition
the data into individual classes
Dimensionality Reduction
• In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or “compressed” representation of the original data.
• If the original data can be reconstructed from the compressed data without any loss of
information, the data reduction is called lossless.
There are two popular and effective methods of lossy dimensionality reduction: wavelet
transforms and principal components analysis
i)Wavelet Transforms
• The discrete wavelet transform (DWT) is a linear signal processing technique that,
when applied to a data vector X, transforms it to a numerically different vector, Xi , of
wavelet coefficients.
• The two vectors are of the same length. When applying this technique to data reduction,
we consider each tuple as an n-dimensional data vector, that is, X = (x1,x2,...,xn),
depicting n measurements made on the tuple from n database attributes.
• A compressed approximation of the data can be retained by storing only a small fraction
of the strongest of the wavelet coefficients.
• The technique also works to remove noise without smoothing out the main features of
the data, making it effective for data cleaning as well
• There are several families of DWTs. Popular wavelet transforms include the Haar-2,
Daubechies-4, and Daubechies-6 transforms.
• The method is as follows:
1.The length, L, of the input data vector must be an integer power of 2. This condition
can be met by padding the data vector with zeros as necessary (L ≥ n).
2.Each transform involves applying two functions. The first applies some data
smoothing, such as a sum or weighted average. The second performs a weighted
difference, which acts to bring out the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (x2i ,x2i+1). This results in two sets of data of length L/2. In general,
these represent a smoothed or low-frequency version of the input data and the high
frequency content of it, respectively.
4. The two functions are recursively applied to the sets of data obtained in the previous
loop, until the resulting data sets obtained are of length 2.
5. Selected values from the data sets obtained in the above iterations are designated the
wavelet coefficients of the transformed data
iv)Sampling
• Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random sample of the data.
• Suppose that a large data set, D, contains N tuples. Let’s look at the most common ways
that we could sample D for data reduction
• Simple random sample without replacement (SRSWOR) of size s: This is created
by drawing s of the N tuples from D (s < N), where the probability of drawing any tuple
in D is 1/N, that is, all tuples are equally likely to be sampled.
• Simple random sample with replacement (SRSWR) of size s: This is similar to
SRSWOR, except that each time a tuple is drawn from D, it is recorded and then
replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn
again
• Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then
an SRS of s clusters can be obtained, where s < M. For example, tuples in a database
are usually retrieved a page at a time, so that each page can be considered a cluster. A
reduced data representation can be obtained by applying, say, SRSWOR to the pages,
resulting in a cluster sample of the tuples
• Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified
sample of D is generated by obtaining an SRS at each stratum.
Data Discretization and Concept Hierarchy Generation
• Data discretization: Data discretization techniques can be used to reduce the number
of values for a given continuous attribute by dividing the range of the attribute into
intervals.
• Interval labels can then be used to replace actual data values.
• If the discretization process uses class information, then we say it is supervised
discretization.
• Otherwise, it is unsupervised. If the process starts by first finding one or a few points
(called split points or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals
• Concept hierarchy :A concept hierarchy for a given numerical attribute defines a
discretization of the attribute.
• Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts (such as numerical values for the attribute age) with higher-level
concepts (such as youth, middle-aged, or senior).
Discretization and Concept Hierarchy Generation for Numerical Data
The following are methods for Data Discretization and Concept Hierarchy Generation:
• Binning:Binning is a top-down splitting technique based on a specified number of bins.
• ii)Histogram Analysis: histogram analysis is an unsupervised discretization technique
because it does not use class information.
• iii)Cluster Analysis:
• Cluster analysis is a popular data discretization method.
• iv) Discretization by Intuitive Partitioning:
• Numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear
intuitive or “natural.”
• For example, annual salaries broken into ranges like ($50,000, $60,000] are often more
desirable than ranges like ($51,263.98, $60,872.34]
• The 3-4-5 rule can be used to segment numerical data into relatively uniform,
naturalseeming intervals.
• In general, the rule partitions a given range of data into 3, 4, or 5 relatively equal-width
intervals, recursively and level by level, based on the value range at the most significant
digit.
• The rule is as follows:
• If an interval covers 3, 6, 7, or 9 distinct values at the most significant digit, then
partition the range into 3 intervals
• If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range
into 4 equal-width intervals.
• If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the
range into 5 equal-width intervals.
• Concept Hierarchy Generation for Nominal Data(Categorical Data ):
• Categorical data are discrete data.
• Categorical attributes have a finite (but possibly large) number of distinct values, with
no ordering among the values.
• Examples include geographic location, job category, and item type. There are several
methods for the generation of concept hierarchies for categorical data.
i) Specification of a partial ordering of attributes explicitly at the schema level by
users or experts:
• Concept hierarchies for nominal attributes or dimensions typically involve a group of
attributes.
• A user or expert can easily define a concept hierarchy by specifying a partial or total
ordering of the attributes at the schema level.
• For example, suppose that a relational database contains the following group of
attributes: street, city, province or state, and country
• A hierarchy can be defined by specifying the total ordering among these attributes at
the schema level such as street < city < province or state < country.
ii) Specification of a portion of a hierarchy by explicit data grouping:
• This is essentially the manual definition of a portion of a concept hierarchy.
• In a large database, it is unrealistic to define an entire concept hierarchy by explicit
value enumeration.
• On the contrary, we can easily specify explicit groupings for a small portion of
intermediate-level data.
• For example, after specifying that province and country , form a hierarchy at the schema
level, a user could define some intermediate levels manually, such as “{Alberta,
Saskatchewan, Manitoba} ⊂ prairies Canada” and “{British Columbia, prairies
Canada} ⊂ Western Canada.”
iii) Specification of a set of attributes, but not of their partial ordering:
• A user may specify a set of attributes forming a concept hierarchy, but omit to explicitly
state their partial ordering.
• The system can then try to automatically generate the attribute ordering so as to
construct a meaningful concept hierarchy.
iv)Specification of only a partial set of attributes:
• The user may have included only a small subset of the relevant attributes in the
hierarchy specification.
• For example, instead of including all of the hierarchically relevant attributes for
location, the user may have specified only street and city