0% found this document useful (0 votes)
33 views62 pages

Unit 1 DM

The document provides an overview of data mining, including its definition, characteristics, and the Knowledge Discovery in Databases (KDD) process, which consists of steps like data cleaning, integration, selection, transformation, mining, evaluation, and representation. It discusses various types of data, data mining tasks, functionalities, and the integration of data mining systems with data warehouses. Additionally, it highlights major issues in data mining, such as data reduction and dimensionality reduction, emphasizing the importance of extracting useful patterns from large datasets.

Uploaded by

enuguprasanna23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views62 pages

Unit 1 DM

The document provides an overview of data mining, including its definition, characteristics, and the Knowledge Discovery in Databases (KDD) process, which consists of steps like data cleaning, integration, selection, transformation, mining, evaluation, and representation. It discusses various types of data, data mining tasks, functionalities, and the integration of data mining systems with data warehouses. Additionally, it highlights major issues in data mining, such as data reduction and dimensionality reduction, emphasizing the importance of extracting useful patterns from large datasets.

Uploaded by

enuguprasanna23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Data Mining

Unit-1
Syllabus

Data Mining:
Data–Types of Data–, Data Mining Functionalities–
Interestingness Patterns–Classification of Data Mining
systems– Data mining Task primitives –Integration of Data
mining system with a Data warehouse–Major issues in Data
Mining–Data Preprocessing.
What is Data Mining?

✔Extracting the information from large collection of data which is unknown to the user.
Characteristics of Data Mining:
Non-Trivial: should be relevant that the data to be required.
Novel: Unique all times- should give same results at all times., even apply different algorithm.
Useful: information which is retrieved should be useful for decision making.

Data mining is used in business to make better managerial decisions by:


❑Automatic summarization of data
❑Extracting essence of information stored.
❑Discovering patterns in raw data.
KDD Process in Data Mining- Knowledge Discovery in Databases
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.
✔Cleaning in case of Missing values.
✔Cleaning noisy data, where noise is a random or variance error.
✔Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common
source(DataWarehouse).
✔Data integration using Data Migration tools.
✔Data integration using Data Synchronization tools.
✔Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved from
the data collection. Data selection using Neural network.
✔Data selection using Decision Trees.
✔Data selection using Naive bayes.
✔Data selection using Clustering, Regression, etc
4. Data Transformation: Data Transformation is defined as the process of transforming data into appropriate form required by
mining procedure.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially useful.
✔Transforms task relevant data into patterns.
✔Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on
given measures.
✔Find interestingness score of each pattern.
✔Uses summarization and Visualization to make data understandable by user.
7. Knowledge representation: Knowledge representation is defined as technique which utilizes visualization tools to represent
data mining results. Note:
KDD is an iterative process where evaluation
✔Generate reports. measures can be enhanced, mining can be
✔Generate tables. refined, new data can be integrated and
transformed in order to get different and more
✔Generate discriminant rules, classification rules, characterization rules, etc. appropriate results.
Types of Data
Data: Data Is Information That Has Been Translated Into A Form That Is Efficient For Movement Or
Processing.

The Most Basic Forms Of Data For Mining Applications Are


1. Database Data
2. Data Warehouse Data
3. Transactional Data
4. Other Kinds Of Data
1. Database Data

❖DBMS – database management system, contains a collection of interrelated databases


Ex: Faculty database, student database, publications database
❖Each database contains a collection of tables and functions to manage and access the
data.
Ex: student_bio, student_parking
❖Each table contains columns and rows, with columns as attributes of data and rows as
records.
❖Tables can be used to represent the relationships between or among multiple tables.
Example
Through the use of relational queries, you can ask things like, “Show me a list of all items that were sold in the last quarter.”
Relational languages also use aggregate functions such as sum, avg (average), count, max (maximum), and min (minimum).
Using aggregates allows you to ask:
1. “Show me the total sales of the last month, grouped by branch,” or
2. “How many sales transactions occurred in the month of December?”
3. or “Which salesperson had the highest sales?”

When mining relational databases, we can go further by searching for trends or data
patterns.
For example, data mining systems can analyze customer data to predict the credit risk of new customers based on their
income, age, and previous credit information.
Data mining systems may also detect deviations—that is, items with sales that are far from those expected in comparison
with the previous year.
2. Data Warehouse Data
✔A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a
single site.
✔Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data
refreshing.
✔A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which each dimension corresponds to an
attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure such as count or sum(sales
amount). A data cube provides a multidimensional view of data and allows the precipitation and fast access of summarized data.
✔Data are organized around major subjects
e.g. customer, item, supplier and activity.
✔Provide information from a historical perspective
e.g. from the past 5 – 10 years
✔Typically summarized to a higher level
e.g. a summary of the transactions per item type for each store
✔User can perform drill-down or roll-up operation to view the data at
different degrees of summarization.
3. Transactional Data
✔In general, each record in a transactional database captures a transaction, such as a customer’s purchase, a flight
booking, or a user’s clicks on a web page.
✔A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the
transaction, such as the items purchased in the transaction.
✔A transactional database may have additional tables, which contain other information related to the transactions,
such as item description, information about the salesperson or the branch, and so on.
4. Other Kinds Data
✔Time-related Or Sequence Data (E.G., Historical Records, Stock Exchange Data, And Time-series And Biological Sequence Data),

✔Data Streams (E.G., Video Surveillance And Sensor Data, Which Are Continuously Transmitted),

✔Spatial Data (E.G., Maps),

✔Engineering Design Data (E.G., The Design Of Buildings, System Components, Or Integrated Circuits),

✔Hypertext And Multimedia Data (Including Text, Image, Video, And Audio Data),

✔Graph And Networked Data (E.G., Social And Information Networks),

✔The Web (A Huge, Widely Distributed Information Repository Made Available By The Internet).
A typical DM System Architecture

Data Mining Tasks:


Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Data Sources: Database, World Wide Web(WWW), and data warehouse are parts of data sources. The data in these sources
may be in the form of plain text, spreadsheets, or other forms of media like photos or videos. WWW is one of the biggest sources
of data.
Database Server: The database server contains the actual data ready to be processed. It performs the task of handling data
retrieval as per the request of the user.
Data Mining Engine: It is one of the core components of the data mining architecture that performs all kinds of data mining
techniques like association, classification, characterization, clustering, prediction, etc.
Pattern Evaluation Modules: They are responsible for finding interesting patterns in the data and sometimes they also
interact with the database servers for producing the result of the user requests.
Graphic User Interface: Since the user cannot fully understand the complexity of the data mining process so graphical
user interface helps the user to communicate effectively with the data mining system.
Knowledge Base: Knowledge Base is an important part of the data mining engine that is quite beneficial in guiding the
search for the result patterns. Data mining engines may also sometimes get inputs from the knowledge base. This knowledge
base may contain data from user experiences. The objective of the knowledge base is to make the result more accurate and
reliable.
Data Mining Functionalities
Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks.
1. Descriptive mining
2. Predictive mining

2. Predictive Data Mining is the Analysis


1. Descriptive Data Mining is a data mining
done to predict a future event or other
technique that identifies what happened
data or trends.
in the past by analyzing the stored past
Ex:
data.
✔Predicting employee growth in HR.
Ex: research, business, economics, social ✔Predicting performance in sports.
✔Forecasting patterns in weather.
sciences, and healthcare.
✔Fraud Detection
Data Mining Functionalities
✔Classification: data into different classes
✔Clustering & Anomaly Detection (Outlier Change Detection): clustering groups a set of objects so that objects in the
same group are more similar to each other than those in other groups & identifies unusual data records.
✔Regression: predicts a range of numeric values based on a continuous dataset
✔Association Rules: discovering interesting relations between variables in large databases
✔Decision Trees: a model that uses a tree-like graph of decisions and their possible consequences.
✔Neural Networks: neural networks are a series of algorithms that attempt to recognise underlying relationships in a
data set.
✔Data Visualization: Turning complex data sets into graphical representations that are easy to understand and interpret.
✔Text Mining: Utilizing techniques to extract qualitative information from text data sources.
Interestingness Patterns
A data mining system has the potential to generate thousands or even millions of patterns, or rules. then “are
all of the patterns interesting?” Typically, not—only a small fraction of the patterns potentially generated
would be of interest to any given user.

What makes a pattern interesting? understood by humans, valid on new or test data with some degree of certainty,

Potentially useful and novel.

Can a data mining system generate all the interesting patterns? refers to the completeness of a data mining algorithm.

Can a data mining system generate only interesting patterns? It is highly desirable for data mining systems to generate

only interesting patterns.An interesting pattern represents knowledge.


Classification of Data Mining Systems Classification of the data mining system helps
users to understand the system and match their
requirements with such systems.
Data mining systems can be categorized according to various criteria, as follows:
i) Classification according to the kinds of databases mined:
✔Database systems can be classified according to data models, we may have a relational,
transactional, object-relational, or data warehouse mining system.
✔Each of which may require its own data mining technique.
ii) Classification according to the kinds of knowledge mined:
✔Data mining systems can be categorized according to the kinds of knowledge they mine, that is,
based on data mining functionalities, such as characterization, discrimination, association and
correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis.
iii) Classification according to the kinds of techniques utilized:
✔Data mining systems can be categorized according to the underlying data mining techniques
employed.
✔These techniques can be described according to the degree of user interaction involved (e.g.,
autonomous systems, interactive exploratory systems, query-driven systems).
IV) Classification according to the applications adapted:

✔Data mining systems can also be categorized according to the applications they adapt.

✔For example, data mining systems may be tailored specifically for finance,

telecommunications, DNA, stock markets, e-mail, and so on


Data Mining Task primitives
✔A data mining task can be specified in the form of
a data mining query, which is input to the data
mining system.
✔A data mining query is defined in terms of data
mining task primitives.
✔These primitives allow the user to interactively
communicate with the data mining system during
the mining process to discover interesting
patterns.
Set of task relevant data to be mined
This specifies the portions of the database or the set of data in which the user is interested.
This portion includes the following
✔Database Attributes
✔Data Warehouse dimensions of interest
For example, suppose that you are a manager of All Electronics in charge of sales in the United States and Canada. You would like to
study the buying trends of customers in Canada. Rather than mining on the entire database. These are referred to as relevant attributes.
Kind of knowledge to be mined
This specifies the data mining functions to be performed, such as Characterization& Discrimination
✔Association
✔Classification
✔Clustering
✔Prediction
✔Outlier analysis
For instance, if studying the buying habits of customers in Canada, you may choose to mine associations between customer profiles
and the items that these customers like to buy.
Background knowledge to be used in discovery process: Users can specify background knowledge, or knowledge
about the domain to be mined. This knowledge is useful for guiding the knowledge discovery process, and for evaluating
the patterns found. User beliefs about relationship in the data.
✔Concept hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels of
abstraction.
Interestingness measures and thresholds for pattern evaluation

✔The Interestingness measures are used to separate interesting and uninteresting patterns from the knowledge.

✔They may be used to guide the mining process, or after discovery, to evaluate the discovered patterns. Different kinds

of knowledge may have different interestingness measures.

For example, interesting measures for association rules include support and confidence.

Representation for visualizing the discovered patterns

✔This refers to the form in which discovered patterns are to be displayed. Users can choose from different forms for

knowledge presentation.

rules, tables, reports, charts, graphs, decision trees, and cubes.


Integration of Data mining system with a Data warehouse
The data mining system is integrated with a database or data warehouse system so that it can do its tasks in an effective mode. A data
mining system operates in an environment that needs to communicate with other data systems like a Database or Dataware house
system.
No Coupling
❑No coupling means that a Data Mining system will not utilize any function of a Data Base or Data Warehouse

system.

❑It may fetch data from a particular source (such as a file system), process data using some data mining algorithms,

and then store the mining results in another file.

Drawbacks of No Coupling

❖First, without using a Database/Data Warehouse system, a Data Mining system may spend a substantial amount of

time finding, collecting, cleaning, and transforming data.

❖Second, there are many tested, scalable algorithms and data structures implemented in Database and Data

Warehouse systems.
Loose Coupling
❑In this Loose coupling, the data mining system uses some facilities / services of a database or data warehouse

system. The data is fetched from a data repository managed by these (DB/DW) systems.

❑Data mining approaches are used to process the data and then the processed data is saved either in a file or in a

designated area in a database or data warehouse.

❑Loose coupling is better than no coupling because it can fetch any portion of data stored in Databases or Data

Warehouses by using query processing, indexing, and other system facilities.

Drawbacks of Loose Coupling

❑It is difficult for loose coupling to achieve high scalability and good performance with large data sets.
Semi-Tight Coupling
✔Semi tight coupling means that besides linking a Data Mining system to a Data Base/Data Warehouse system, efficient
implementations of a few essential data mining primitives can be provided in the DB/DW system.
✔These primitives can include sorting, indexing, aggregation, histogram analysis, multi way join, and pre-computation of
some essential statistical measures, such as sum, count, max, min, standard deviation.
Advantage of Semi-Tight Coupling
This Coupling will enhance the performance of Data Mining systems

Tight Coupling
Tight coupling means that a Data Mining system is smoothly integrated into the Data Base/Data Warehousesystem.
The data mining subsystem is treated as one functional component of information system. Data mining queries and
functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing
methods of a DB or DW system.
Major Issues in Data Mining
Data Reduction
✔Data reduction is a process used in data processing and analysis to reduce the amount of data without significantly affecting its integrity

or quality. The goal is to simplify or compress the dataset to make it easier to store, process, and analyze while retaining the essential

information.

✔Mining on the reduced data set should be more efficient yet to produce same analytical results.

Ex: Image Processing

Techniques in data Reduction:

1. Data Compression

2. Dimensionality Reduction

3. Numeracity reduction
Dimensionality Reduction
->The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.
->Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information.“
Ex: speech recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
Wavelet Transformation
● The signal is represented by wavelets, which are small, oscillating functions that capture both time and

frequency information.

● Wavelets are decomposed a signal into a set of basis functions. These basic functions are called wavelets.

● Wavelet Transformation works on both positive and negative areas.

● Types of wavelet transforms

1. Continuous wavelet transforms

2. Discrete wavelet transforms


Continue
● The discrete wavelet transform (DWT) is a signal processing technique that transforms linear signals.

● The data vector X is transformed into a numerically different vector, Xo, of wavelet coefficients when the

DWT is applied.

● The two vectors X and Xo must be of the same length. When applying this technique to data reduction, we

consider n-dimensional data tuple, that is, X = (x1,x2,…,xn), where n is the number of attributes present in

the relation of the data set.


Pyramid Method
Pyramid Algorithm
1. The input data vector is of length L and L is an integer and is the power of 2. If length L is not the power of
2 then we can append the zeroes at the end of input data vector to make it as power of 2.
2. We apply two functions for each transform of the data vector . The first function is to perform the data
smoothing, like finding the weighted average of the data vectors. The second function is to find the
weighted difference and this retrieves the important features of the input vector .
3. We apply the two functions to the X axis pairs of the data points (x2i ,x2i+1). Two different data sets of
length L/2 are obtained after applying the two functions. The first data set is the low-frequency version
of the original data and the second one is the high frequency data set of it.
4. These two functions are applied to the data vectors recursively until the obtained resultant data vectors
are of length 2.
The wavelet coefficients are assigned to the transformed data vectors finally.
Principle Component Analysis
1. Principal Component Analysis (PCA) is a data reduction technique used to simplify large datasets by reducing the number of variables
(features) while preserving as much information as possible.
2. It does this by transforming the original variables into new, uncorrelated variables called principal components.
Steps of PCA:

Standardization: Ensure the data has a mean of 0 and standard deviation of 1.

Covariance Matrix: Calculate the covariance matrix to understand how variables are related.

Eigenvectors and Eigenvalues: Identify the principal components (eigenvectors) and the amount of variance they capture (eigenvalues).

Project Data: Transform the original data into the new principal components.

Applications:

● Data compression: Reduce the number of features while keeping essential information.

● Visualization: PCA can reduce complex datasets (e.g., 10 features) to 2 or 3 components for easy visualization.
Numeracity Reduction
● It is the technique to replace the original data by alternative smaller forms of data representation.
Types:
1. Parametric
2. Non-Parameric
1. Parametric
This method assumes a model into which the data fits. Data model parameters are estimated, and only those

parameters are stored, and the rest of the data is discarded.

1. Regression

2. Log-linear Regression

Regression: Regression can be a simple linear regression or multiple linear regression. When there is only a

single independent attribute, such a regression model is called simple linear regression. If there are

multiple independent attributes, then such regression models are called multiple linear regression.

Log-Linear Model: The Log-Linear model discovers the relationship between two or more discrete attributes
2. Non-Parametric
A non-parametric numerosity reduction technique does not assume any

model.

1. Histogram

2. Clustering

3. Sampling

4. Data Cube Aggregation


✔A histogram is the data representation in terms of frequency.
✔Histogram of attribute ‘A’ partitioned to the data into disjoint subsets referred as

Histogram buckets of bins.


1. Single – ton Bucket 2. Equal- width bucket
1. Single ton Bucket
2. Equal Width Bucket
Exercise
Data Discretization and Concept Hierarchy

✔ method of converting a huge number of data values

into smaller ones so that the evaluation and

management of data become easy.

✔data discretization is a method of converting

attributes values of continuous data into a finite set

of intervals with minimum data loss.


Supervised discretization refers to a method in which the class data is used.
Unsupervised discretization refers to a method depending upon the way which operation proceeds.
Top-down Discretization -
If the process starts by first finding one or a few points called split points or cut points to split the entire attribute range and
then repeat this recursively on the resulting intervals.

Bottom-up Discretization -
✔Starts by considering all of the continuous values as potential split-points.
✔Removes some by merging neighborhood values to form intervals, and then recursively applies this process to the resulting
intervals.
Concept Hierarchies
✔Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the attribute values,
known as a Concept Hierarchy.
✔Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts with higher-level
concepts.
✔In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple
levels of abstraction defined by concept hierarchies.
✔organization provides users with the flexibility to view data from different perspectives.
✔Data mining on a reduced data set means fewer input and output operations and is more efficient than mining on a
larger data set.
✔Because of these benefits, discretization techniques and concept hierarchies are typically applied before data
mining, rather than during mining.
Discretization and Concept Hierarchy Generation for Numerical Data
1. Binning

2. Histogram Analysis

3. Cluster Analysis

4. Discretization by Intuitive Partitioning

1] Binning
❑Binning is a top-down splitting technique based on a specified number of bins.
❑Binning is an unsupervised discretization technique because it does not use class information.
❑The sorted values are distributed into several buckets or bins and then replaced with each bin value by the bin mean or
median.
2] Histogram Analysis
✔It is an unsupervised discretization technique because histogram analysis does not use class information.
✔Histograms partition the values for an attribute into disjoint ranges called buckets.
It is also further classified into
Equal-width histogram
Equal frequency histogram
The histogram analysis algorithm can be applied recursively to each partition to automatically generate a multilevel concept hierarchy,
with the procedure terminating once a pre-specified number of concept levels has been reached.
3] Cluster Analysis
✔Cluster analysis is a popular data discretization method.
✔A clustering algorithm can be applied to discretize a numerical attribute of A by partitioning the values of A into clusters or groups.
✔Clustering considers the distribution of A, as well as the closeness of data points, and therefore can produce high-quality discretization
results.
✔Each initial cluster or partition may be further decomposed into several subcultures, forming a lower level of the hierarchy.
4. Discretization by Intuitive Partitioning

✔Numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear intuitive or “natural.”
✔The 3-4-5 rule can be used to segment numerical data into relatively uniform, naturalseeming intervals.
✔In general, the rule partitions a given range of data into 3, 4, or 5 relatively equal-width intervals, recursively and level
by level, based on the value range at the most significant digit.
The rule is as follows:
✔If an interval covers 3, 6, 7, or 9 distinct values at the most significant digit, then partition the range into 3 intervals
✔If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range into 4 equal-width intervals.
✔If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range into 5 equal-width intervals.
Concept Hierarchy Generation for Nominal Data(Categorical Data )
Categorical data are discrete data.
• Categorical attributes have a finite (but possibly large) number of distinct values, with no ordering among the values.
• Examples include geographic location, job category, and item type.

i) Specification of a partial ordering of attributes explicitly at the schema level by users or experts

ii) Specification of a portion of a hierarchy by explicit data grouping

iii) Specification of a set of attributes, but not of their partial ordering

iv) Specification of only a partial set of attributes


i) Specification of a partial ordering of attributes explicitly at the schema level
by users or experts
❑Concept hierarchies for nominal attributes or dimensions
typically involve a group of attributes.
❑A user or expert can easily define a concept hierarchy by
specifying a partial or total ordering of the attributes at the
schema level.
❑For example, suppose that a relational database contains the
following group of attributes: street, city, province or state, and
country
❑ A hierarchy can be defined by specifying the total ordering
among these attributes at the schema level such as street <
city < province or state < country.
❑example of a partial order for the time dimension based on
the attributes day, week, month, quarter, and year is “day
<{month < quarter; week} < year
ii) Specification of a partial of a hierarchy by explicit data grouping
✔Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or
attribute, resulting in a set-grouping hierarchy. A total or partial order can be defined among
groups of values.
✔Concept hierarchies may be provided manually by system users
✔An example of a set-grouping hierarchy is shown in Figure for the dimension price, where an interval
($X…$Y] denotes the range from $X (exclusive) to $Y (inclusive).
iii) Specification of a set of attributes, but not of their partial ordering

vi) Specification of only a partial set of attributes


Unit-1 Completed
THANK YOU

You might also like