Unit 1 DM
Unit 1 DM
Unit-1
Syllabus
Data Mining:
Data–Types of Data–, Data Mining Functionalities–
Interestingness Patterns–Classification of Data Mining
systems– Data mining Task primitives –Integration of Data
mining system with a Data warehouse–Major issues in Data
Mining–Data Preprocessing.
What is Data Mining?
✔Extracting the information from large collection of data which is unknown to the user.
Characteristics of Data Mining:
Non-Trivial: should be relevant that the data to be required.
Novel: Unique all times- should give same results at all times., even apply different algorithm.
Useful: information which is retrieved should be useful for decision making.
When mining relational databases, we can go further by searching for trends or data
patterns.
For example, data mining systems can analyze customer data to predict the credit risk of new customers based on their
income, age, and previous credit information.
Data mining systems may also detect deviations—that is, items with sales that are far from those expected in comparison
with the previous year.
2. Data Warehouse Data
✔A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a
single site.
✔Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data
refreshing.
✔A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which each dimension corresponds to an
attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure such as count or sum(sales
amount). A data cube provides a multidimensional view of data and allows the precipitation and fast access of summarized data.
✔Data are organized around major subjects
e.g. customer, item, supplier and activity.
✔Provide information from a historical perspective
e.g. from the past 5 – 10 years
✔Typically summarized to a higher level
e.g. a summary of the transactions per item type for each store
✔User can perform drill-down or roll-up operation to view the data at
different degrees of summarization.
3. Transactional Data
✔In general, each record in a transactional database captures a transaction, such as a customer’s purchase, a flight
booking, or a user’s clicks on a web page.
✔A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the
transaction, such as the items purchased in the transaction.
✔A transactional database may have additional tables, which contain other information related to the transactions,
such as item description, information about the salesperson or the branch, and so on.
4. Other Kinds Data
✔Time-related Or Sequence Data (E.G., Historical Records, Stock Exchange Data, And Time-series And Biological Sequence Data),
✔Data Streams (E.G., Video Surveillance And Sensor Data, Which Are Continuously Transmitted),
✔Engineering Design Data (E.G., The Design Of Buildings, System Components, Or Integrated Circuits),
✔Hypertext And Multimedia Data (Including Text, Image, Video, And Audio Data),
✔The Web (A Huge, Widely Distributed Information Repository Made Available By The Internet).
A typical DM System Architecture
What makes a pattern interesting? understood by humans, valid on new or test data with some degree of certainty,
Can a data mining system generate all the interesting patterns? refers to the completeness of a data mining algorithm.
Can a data mining system generate only interesting patterns? It is highly desirable for data mining systems to generate
✔Data mining systems can also be categorized according to the applications they adapt.
✔For example, data mining systems may be tailored specifically for finance,
✔The Interestingness measures are used to separate interesting and uninteresting patterns from the knowledge.
✔They may be used to guide the mining process, or after discovery, to evaluate the discovered patterns. Different kinds
For example, interesting measures for association rules include support and confidence.
✔This refers to the form in which discovered patterns are to be displayed. Users can choose from different forms for
knowledge presentation.
system.
❑It may fetch data from a particular source (such as a file system), process data using some data mining algorithms,
Drawbacks of No Coupling
❖First, without using a Database/Data Warehouse system, a Data Mining system may spend a substantial amount of
❖Second, there are many tested, scalable algorithms and data structures implemented in Database and Data
Warehouse systems.
Loose Coupling
❑In this Loose coupling, the data mining system uses some facilities / services of a database or data warehouse
system. The data is fetched from a data repository managed by these (DB/DW) systems.
❑Data mining approaches are used to process the data and then the processed data is saved either in a file or in a
❑Loose coupling is better than no coupling because it can fetch any portion of data stored in Databases or Data
❑It is difficult for loose coupling to achieve high scalability and good performance with large data sets.
Semi-Tight Coupling
✔Semi tight coupling means that besides linking a Data Mining system to a Data Base/Data Warehouse system, efficient
implementations of a few essential data mining primitives can be provided in the DB/DW system.
✔These primitives can include sorting, indexing, aggregation, histogram analysis, multi way join, and pre-computation of
some essential statistical measures, such as sum, count, max, min, standard deviation.
Advantage of Semi-Tight Coupling
This Coupling will enhance the performance of Data Mining systems
Tight Coupling
Tight coupling means that a Data Mining system is smoothly integrated into the Data Base/Data Warehousesystem.
The data mining subsystem is treated as one functional component of information system. Data mining queries and
functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing
methods of a DB or DW system.
Major Issues in Data Mining
Data Reduction
✔Data reduction is a process used in data processing and analysis to reduce the amount of data without significantly affecting its integrity
or quality. The goal is to simplify or compress the dataset to make it easier to store, process, and analyze while retaining the essential
information.
✔Mining on the reduced data set should be more efficient yet to produce same analytical results.
1. Data Compression
2. Dimensionality Reduction
3. Numeracity reduction
Dimensionality Reduction
->The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.
->Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information.“
Ex: speech recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
Wavelet Transformation
● The signal is represented by wavelets, which are small, oscillating functions that capture both time and
frequency information.
● Wavelets are decomposed a signal into a set of basis functions. These basic functions are called wavelets.
● The data vector X is transformed into a numerically different vector, Xo, of wavelet coefficients when the
DWT is applied.
● The two vectors X and Xo must be of the same length. When applying this technique to data reduction, we
consider n-dimensional data tuple, that is, X = (x1,x2,…,xn), where n is the number of attributes present in
Covariance Matrix: Calculate the covariance matrix to understand how variables are related.
Eigenvectors and Eigenvalues: Identify the principal components (eigenvectors) and the amount of variance they capture (eigenvalues).
Project Data: Transform the original data into the new principal components.
Applications:
● Data compression: Reduce the number of features while keeping essential information.
● Visualization: PCA can reduce complex datasets (e.g., 10 features) to 2 or 3 components for easy visualization.
Numeracity Reduction
● It is the technique to replace the original data by alternative smaller forms of data representation.
Types:
1. Parametric
2. Non-Parameric
1. Parametric
This method assumes a model into which the data fits. Data model parameters are estimated, and only those
1. Regression
2. Log-linear Regression
Regression: Regression can be a simple linear regression or multiple linear regression. When there is only a
single independent attribute, such a regression model is called simple linear regression. If there are
multiple independent attributes, then such regression models are called multiple linear regression.
Log-Linear Model: The Log-Linear model discovers the relationship between two or more discrete attributes
2. Non-Parametric
A non-parametric numerosity reduction technique does not assume any
model.
1. Histogram
2. Clustering
3. Sampling
Bottom-up Discretization -
✔Starts by considering all of the continuous values as potential split-points.
✔Removes some by merging neighborhood values to form intervals, and then recursively applies this process to the resulting
intervals.
Concept Hierarchies
✔Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the attribute values,
known as a Concept Hierarchy.
✔Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts with higher-level
concepts.
✔In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple
levels of abstraction defined by concept hierarchies.
✔organization provides users with the flexibility to view data from different perspectives.
✔Data mining on a reduced data set means fewer input and output operations and is more efficient than mining on a
larger data set.
✔Because of these benefits, discretization techniques and concept hierarchies are typically applied before data
mining, rather than during mining.
Discretization and Concept Hierarchy Generation for Numerical Data
1. Binning
2. Histogram Analysis
3. Cluster Analysis
1] Binning
❑Binning is a top-down splitting technique based on a specified number of bins.
❑Binning is an unsupervised discretization technique because it does not use class information.
❑The sorted values are distributed into several buckets or bins and then replaced with each bin value by the bin mean or
median.
2] Histogram Analysis
✔It is an unsupervised discretization technique because histogram analysis does not use class information.
✔Histograms partition the values for an attribute into disjoint ranges called buckets.
It is also further classified into
Equal-width histogram
Equal frequency histogram
The histogram analysis algorithm can be applied recursively to each partition to automatically generate a multilevel concept hierarchy,
with the procedure terminating once a pre-specified number of concept levels has been reached.
3] Cluster Analysis
✔Cluster analysis is a popular data discretization method.
✔A clustering algorithm can be applied to discretize a numerical attribute of A by partitioning the values of A into clusters or groups.
✔Clustering considers the distribution of A, as well as the closeness of data points, and therefore can produce high-quality discretization
results.
✔Each initial cluster or partition may be further decomposed into several subcultures, forming a lower level of the hierarchy.
4. Discretization by Intuitive Partitioning
✔Numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear intuitive or “natural.”
✔The 3-4-5 rule can be used to segment numerical data into relatively uniform, naturalseeming intervals.
✔In general, the rule partitions a given range of data into 3, 4, or 5 relatively equal-width intervals, recursively and level
by level, based on the value range at the most significant digit.
The rule is as follows:
✔If an interval covers 3, 6, 7, or 9 distinct values at the most significant digit, then partition the range into 3 intervals
✔If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range into 4 equal-width intervals.
✔If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range into 5 equal-width intervals.
Concept Hierarchy Generation for Nominal Data(Categorical Data )
Categorical data are discrete data.
• Categorical attributes have a finite (but possibly large) number of distinct values, with no ordering among the values.
• Examples include geographic location, job category, and item type.
i) Specification of a partial ordering of attributes explicitly at the schema level by users or experts