0% found this document useful (0 votes)

4 views

Data processing_unit-3

The document provides an overview of data preprocessing, emphasizing the importance of data quality, which includes factors like completeness, consistency, accuracy, validity, and timeliness. It discusses methods for data cleaning, handling missing and noisy data, and the significance of outlier analysis, along with various data reduction strategies. Additionally, it highlights the role of regression in analyzing data relationships and predicting outcomes.

Uploaded by

drashtibarot1471

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Data processing_unit-3

Uploaded by

drashtibarot1471

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Data Preprocessing: An Overview

▪ Data Quality: Why Preprocess the Data?

➢ Data quality is a measure of the condition of data based on

factors such as accuracy, completeness, consistency, reliability
and whether it's up to date. Measuring data quality levels can
help organizations identify data errors that need to be resolved
and assess whether the data in their IT systems is fit to serve its
intended purpose.

➢ Five Factors of High Quality Data

1. Completeness
2. Consistency
3. Accuracy
4. Validity
5. Timeliness

1. Data Completeness

➢ Data completeness refers to whether there are any gaps in the

data from what was expected to be collected, and what was
actually collected.

➢ Example: An inspection is done on a vehicle and the inspector

accidentally does not indicate the current hour meter reading on
the vehicle, which is a required field for that inspection. This
has rendered the inspection incomplete and less valuable
because important information is left out.

Prepared by Khyati Shah,SLICA Page 1

2. Data Consistency

➢ Data consistency is the measure that indicates that data is not

conflicting each other and not conflicting business rules. If data
is replicated in multiple places, it needs to be consistent across
all instances.

➢ Example: For a department store, you might hold data on a

particular customer through a loyalty program, mailing list,
online accounts payment system and order fulfillment system.
In that tangled mess of systems there may be misspelled names,
old addresses and conflicting status flags. This could cause
problems in processes that read data

3. Data Accuracy

➢ Accuracy is the measure that indicates how well and how

correctly is data represented in the data base, comparing its
value to the real world or to a reference data. Accuracy is a
crucial data quality characteristic because inaccurate
information can cause significant problems with severe
consequences.

➢ Example: on the same inspection example, if the operator

records the mileage at 40,000 miles instead of 60,000 miles, this
is inaccurate data, resulting in misinformation and related
issues.

4. Data Validity

➢ Fixing invalid data often means that there is an issue with a

process rather than a result. Validity of data is determined by
whether the data measure that which it is intended to measure.

Prepared by Khyati Shah,SLICA Page 2

➢ Example: When new information is needed but forms don’t get
changed, the data is no longer valid because it does not properly
measure what it is supposed to.

5. Data Timeliness

➢ Data timeliness refers to the expectation of when data should be

received in order for the information to be used effectively.

➢ Example: At the end of the month, several sales

representatives fail to file their sales record on time. These are
also several corrections & adjustments which flow into after the
end of the month. Data stored in the database are incomplete for
a time after each month.

▪ Data Cleaning

➢ Data cleansing or data cleaning is the process of identifying and

removing (or correcting) inaccurate records from a dataset,
table, or database.

➢ Data cleaning (or data cleansing) routines attempt to fill in

missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.

➢ After cleaning, a dataset should be uniform with other related

datasets in the operation.

o Missing Values

➢ Missing data is a deceptively tricky issue in machine learning.

We cannot just ignore or remove the missing observation. They
must be handled carefully as they can be an indication of
something important.

Prepared by Khyati Shah,SLICA Page 3

➢ Following are some methods

1. Ignore the tuple:

➢ This is usually done when the class label is missing (assuming

the mining task involves classification). This method is not very
effective, unless the tuple contains several attributes with
missing values.

➢ By ignoring the tuple, we do not make use of the remaining

attributes’ values in the tuple. Such data could have been useful
to the task at hand.

2. Fill in the missing value manually:

➢ In general, this approach is time consuming and may not be
feasible given a large data set with many missing values.

3. Use a global constant to fill in the missing value:

➢ Replace all missing attribute values by the same constant such

as a label like “Unknown”. If missing values are replaced by,
say, “Unknown,” then the mining program may mistakenly
think that they form an interesting concept, since they all have a
value in common—that of “Unknown.” Hence, although this
method is simple, it is not foolproof.

4. Use a measure of central tendency for the attribute (e.g., the

mean or median) to fill in the missing value:
➢ Measures of central tendency, which indicate the “middle”
value of a data distribution. For normal (symmetric) data
distributions, the mean can be used, while skewed data
distribution should employ the median.

Prepared by Khyati Shah,SLICA Page 4

➢ For example, suppose that the data distribution regarding the
income of AllElectronics customers is symmetric and that the
mean income is $56,000. Use this value to replace the missing
value for income.

5. Use the attribute mean or median for all samples belonging

to the same class as the given tuple:

➢ For example, if classifying customers according to credit risk,

we may replace the missing value with the mean income value
for customers in the same credit risk category as that of the
given tuple. If the data distribution for a given class is skewed,
the median value is a better choice.

6. Use the most probable value to fill in the missing value:

➢ This may be determined with regression, inference-based tools
using a Bayesian formalism, or decision tree.

➢ Induction for example, using the other customer attributes in

your data set, you may construct a decision tree to predict the
missing values for income.

➢ Methods 3 through 6 bias the data—the filled-in value may not

be correct. Method 6, however, is a popular strategy. In
comparison to the other methods, it uses the most information
from the present data to predict missing values.

Prepared by Khyati Shah,SLICA Page 5

o Noisy Data

➢ Noise is a random error or variance in a measured variable.

Noisy Data may be due to faulty data collection instruments,
data entry problems and technology limitation.

➢ How to Handle Noisy Data? Following are data smoothing

techniques.

1. Binning

➢ Binning methods sorted data value by consulting its “neighbor-

hood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins.

➢ For example

Methods of binning

1) smoothing by bin means

➢ In smoothing by bin means, each value in a bin is replaced by

the mean value of the bin.

Prepared by Khyati Shah,SLICA Page 6

➢ The data for price are first sorted and then partitioned into
equal-frequency bins of size 3 (i.e., each bin contains three
values). For example, the mean of the values 4, 8, and 15 in Bin
1 is 9. Therefore, each original value in this bin is replaced by
the value 9.

2) Smoothing by bin boundaries

➢ In smoothing by bin boundaries, the minimum and maximum

values in a given bin are identified as the bin boundaries. Each
bin value is then replaced by the closest boundary value.

➢ Similarly, smoothing by bin medians can be employed, in which

each bin value is replaced by the bin median. In general, the
larger the width, the greater the effect of the smoothing.

Prepared by Khyati Shah,SLICA Page 7

3) smoothing by bin median

Smoothing By Bin Median

Bin 1:8,8,8
Bin 2:21,21,21
Bin 3:28,28,28

Advantages (Pros) of data smoothing

➢ Data smoothing clears the understandability of different
important hidden patterns in the data set.
➢ Data smoothing can be used to help predict trends. Prediction is
very helpful for getting the right decisions at the right time.
➢ Data smoothing helps in getting accurate results from the data.

Cons of data smoothing

➢ Data smoothing doesn’t always provide a clear explanation of
the patterns among the data.

Prepared by Khyati Shah,SLICA Page 8

➢ It is possible that certain data points being ignored by focusing
the other data points.

8 ,16, 9, 15, 21, 21, 24, 30, 26, 27,

30, 34

8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34

Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34

Smoothing by bin means

For Bin 1:
(8+ 9 + 15 +16 / 4) = 12
(4 indicating the total values like 8, 9 , 15, 16)
Bin 1 = 12, 12, 12, 12

For Bin 2:
(21 + 21 + 24 + 26 / 4) = 23
Bin 2 = 23, 23, 23, 23

For Bin 3:
(27 + 30 + 30 + 34 / 4) = 30
Prepared by Khyati Shah,SLICA Page 9
Bin 3 = 30, 30, 30, 30

Smoothing by bin boundaries

Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34

Answer
Bin 1: 8, 8, 16, 16
Bin 2: 21,21,26,26
Bin 3:27,27,27,34

Smoothing by bin median

(9+15)/2=12

Bin 1: 12,12,12,12
Bin 2: 23,23,23,23
Bin 3: 30,30,30,30

2. Regression:

➢ Data smoothing can also be done by regression, a technique that

conforms data values to a function. Linear regression involves
finding the “best” line to fit two attributes (or variables) so that
one attribute can be used to predict the other.
Prepared by Khyati Shah,SLICA Page 10
➢ Regression refers to a type of supervised machine learning
technique that is used to predict any continuous-valued attribute.
Regression helps any business organization to analyze the target
variable and predictor variable relationships. It is a most
significant tool to analyze the data that can be used for financial
forecasting and time series modeling.

➢ Regression involves the technique of fitting a straight line or a

curve on numerous data points. It happens in such a way that the
distance between the data points and cure comes out to be the
lowest.

➢ The most popular types of regression are linear and logistic

regressions. Other than that, many other types of regression can
be performed depending on their performance on an individual
data set.

➢ Regression is divided into five different types

1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression

Application of Regression
➢ Regression is a very popular technique, and it has wide
applications in businesses and industries. The regression
procedure involves the predictor variable and response variable.
The major application of regression is given below.

Prepared by Khyati Shah,SLICA Page 11

o Environmental modeling
o Analyzing Business and marketing behavior
o Financial predictors or forecasting
o Analyzing the new trends and patterns.

3. Outlier analysis:

Outlier

➢ An outlier is an object that deviates significantly from the rest of

the objects. They can be caused by measurement or execution
errors. The analysis of outlier data is referred to as outlier
analysis or outlier mining.

➢ An outlier cannot be termed as a noise or error. Instead, they are

suspected of not being generated by the same method as the rest
of the data objects.

➢ Outliers are of three types, namely –

1. Global (or Point) Outliers (a data point strongly deviates

from all the rest of the data points, it is known as a global
outlier)

2. Collective Outliers(some of the data points, as a whole,

deviate significantly from the rest of the dataset)

3. Contextual (or Conditional) Outliers (A data point may be

an outlier due to a certain condition and may show
normal behavior under another condition.)

Prepared by Khyati Shah,SLICA Page 12

Outlier analysis

➢ The process in which the behavior of the outliers is

identified in a dataset is called outlier analysis. It is also
known as "outlier mining", the process is defined as a
significant task of data mining.

➢ But it is still used in many applications like fraud detection,

medical, etc. It is usually because the events that occur rarely
can store much more significant information than the events that
occur more regularly.

➢ Other applications where outlier detection plays a vital role are

given below.

o Fraud detection in the telecom industry

o In market analysis, outlier analysis enables marketers to
identify the customer's behaviors.
o In the Medical analysis field.
o Fraud detection in banking and finance such as credit cards,
insurance sector, etc.

➢ Outliers may be detected by clustering, for example, where

similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be
considered outliers

Example

➢ A 2-D customer data plot with respect to customer locations in a

city, showing three data clusters. Outliers may be detected as
values that fall outside of the cluster sets.

Prepared by Khyati Shah,SLICA Page 13

o Data Cleaning as a Process

➢ The first step in data cleaning as a process is discrepancy

detection. Discrepancies can be caused by several factors,
including poorly designed data entry forms that have many
optional fields, human error in data entry, deliberate errors (e.g.,
respondents not wanting to divulge information about
themselves), and data decay (e.g., outdated addresses).

➢ Discrepancies may also arise from inconsistent data

representations and inconsistent use of codes.

➢ Data transformations. That is, once we find discrepancies, we

typically need to define and apply (a series of) transformations
to correct them.

➢ Commercial tools can assist in the data transformation step.

Data migration tools allow simple transformations to be

Prepared by Khyati Shah,SLICA Page 14

specified such as to replace the string “gender” by “sex.” ETL
(extraction/transformation/loading) tools allow users to
specify transforms through a graphical user interface (GUI).

▪ Data Reduction
➢ Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume,
That is, mining on the reduced data set should be more efficient
yet produce the same (or almost the same) analytical results.

o Overview of Data Reduction Strategies

➢ Data reduction strategies include dimensionality reduction,

numerosity reduction, and data compression.
1. Dimensionality reduction

➢ The number of input features, variables, or columns present

in a given dataset is known as dimensionality, and the
process to reduce these features is called dimensionality
reduction.

➢ A dataset contains a huge number of input features in various

cases, which makes the predictive modeling task more
complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of
features, for such cases, dimensionality reduction techniques are
required to use.

➢ Dimensionality reduction technique can be defined as, "It is a

way of converting the higher dimensions dataset into lesser
Prepared by Khyati Shah,SLICA Page 15
dimensions dataset ensuring that it provides similar
information." These techniques are widely used in machine
learning for obtaining a better fit predictive model while solving
the classification and regression problems.

➢ It is commonly used in the fields that deal with high-

dimensional data, such as speech recognition, signal
processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.

Benefits of applying Dimensionality Reduction

Some benefits of applying dimensionality reduction technique to

the given dataset are given below:

o By reducing the dimensions of the features, the space

required to store the dataset also gets reduced.
o Less Computation training time is required for reduced
dimensions of features.
o Reduced dimensions of features of the dataset help in
visualizing the data quickly.
o It removes the redundant features (if present) by taking care
of multicollinearity.

Disadvantages of dimensionality Reduction

o There are also some disadvantages of applying the

dimensionality reduction, which are given below:

Prepared by Khyati Shah,SLICA Page 16

o Some data may be lost due to dimensionality reduction.
o In the PCA (Principal Component Analysis) dimensionality
reduction technique, sometimes the principal components
required to consider are unknown.

2. Numerosity reduction

➢ This technique replaces the original data volume by

alternative, smaller forms of data representation.

➢ These techniques may be parametric or nonparametric. For

parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored, instead
of the actual data. (Outliers may also be stored.)
Example- Regression and log-linear models.

➢ Nonparametric methods for storing reduced representations of

the data include histograms, clustering, sampling, and data cube
aggregation.

Difference between Dimensionality Reduction and Numerosity

Reduction

Dimensionality Reduction Numerosity Reduction

In dimensionality reduction, In numerosity reduction, data

data encoding or transformation volume is reduced by choosing
are applied to obtain a reduced alternating, smaller forms of
or compressed representation of data representation.
original data.
It can be used for removing It is merely a representation
irrelevant and redundant technique of original data to a

Prepared by Khyati Shah,SLICA Page 17

attributes. smaller form.
In this technique, some data can In this method, there is no loss
be lost which is inappropriate. of data but the whole data is
represented in a smaller form.

3. Data compression

➢ Transformations are applied so as to obtain a reduced or

“compressed” representation of the original data. If the original
data can be reconstructed from the compressed data
without any information loss, the data reduction is called
“lossless”. If, instead, we can reconstruct only an
approximation of the original data, then the data reduction
is called “lossy”.

Histograms

➢ Histograms use binning to approximate data distributions and

are a popular form of data reduction.

➢ A histogram provides a visual interpretation of numerical data

by showing the number of data points that fall within a specified
range of values (called “bins” or “bucket”)

➢ If each bucket represents only a single attribute–

value/frequency pair, the buckets are called “singleton
buckets”.

➢ Example

Prepared by Khyati Shah,SLICA Page 18

The following data are a list of AllElectronics prices for commonly
sold items (rounded to the nearest dollar). The numbers have been
sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15,
15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.

Following diagram shows a histogram for the data using singleton

buckets. To further reduce the data, it is common to have each
bucket denote a continuous value range for the given attribute. In
each bucket represents a different $10 range for price.

Prepared by Khyati Shah,SLICA Page 19

Sampling

➢ Sampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller
random data sample (or subset).

➢ Types of Sampling

1. Simple random sample without replacement (SRSWOR) of

size

➢ As each item is selected, it is removed from the population is a

method of selection of n units out of the N units one by one
such that at any stage of selection, anyone of the remaining
units have same chance of being selected, i.e. 1/N .

2. Simple random sample with replacement (SRSWR) of size

Prepared by Khyati Shah,SLICA Page 20
➢ is a method of selection of n units out of the N units one by one
such that at each stage of selection each unit has equal chance of
being selected, i.e., 1/ N .

➢ Example

➢ Suppose we have a bowl of 100 unique numbers from 0 to 99.

We want to select a random sample of numbers from the bowl.
After we pick a number from the bowl, we can put the number
aside or we can put it back into the bowl. If we put the number
back in the bowl, it may be selected more than once; if we put it
aside, it can selected only one time.

➢ When a population element can be selected more than one time,

we are sampling with replacement. When a population
element can be selected only one time, we are sampling
without replacement.

3. Cluster sample

Prepared by Khyati Shah,SLICA Page 21

➢ Cluster sampling refers to a type of sampling method. With
cluster sampling, the researcher divides the population into
separate groups, called clusters. Then, a simple random sample
of clusters is selected from the population. The researcher
conducts his analysis on data from the sampled clusters.

➢ For example, let’s consider a scenario where an organization is

looking to survey the performance of smart phones across
Germany. They can divide the entire country’s population into
cities (clusters) and further select cities with the highest
population and also filter those using mobile devices. This
multiple stage sampling is known as cluster sampling.

4. Stratified sample:

➢ Stratified sampling is a type of sampling method in which the

total population is divided into smaller groups or strata to
complete the sampling process.

➢ The strata are formed based on some common characteristics in

the population data. After dividing the population into strata, the
researcher randomly selects the sample proportionally.

➢ For example, a stratified sample may be obtained from

customer data, where a stratum is created for each customer age
group. In this way, the age group having the smallest number of
customers will be sure to be represented.
Prepared by Khyati Shah,SLICA Page 22
▪ Difference between cluster and stratified sampling in
tabular form

Prepared by Khyati Shah,SLICA Page 23

BASIS FOR STRATIFIED CLUSTER
COMPARISON SAMPLING SAMPLING

Meaning Stratified sampling Cluster sampling

is one, in which the refers to a sampling
population is divided method wherein the
into homogeneous members of the
segments, and then population are
the sample is selected at random,
randomly taken from from naturally
the segments. occurring groups
called 'cluster'.
Sample Randomly selected All the individuals
individuals are taken are taken from
from all the strata. randomly selected
clusters.
Selection of Individually Collectively
population
elements
Homogeneity Within group Between groups
Heterogeneity Between groups Within group
Bifurcation Imposed by the Naturally occurring
researcher groups
Objective To increase To reduce cost and
precision and improve efficiency.
representation.

Prepared by Khyati Shah,SLICA Page 24

Data Cube Aggregation

➢ Data cubes store multidimensional aggregated information

▪ Association Rule Mining

Prepared by Khyati Shah,SLICA Page 25

➢ Association rule mining finds interesting associations and
relationships among large sets of data items. This rule shows
how frequently an item set occurs in a transaction.

➢ Association rules are if-then statements that help to show the

probability of relationships between data items within large data
sets in various types of databases. Association rule mining has a
number of applications and is widely used to help discover sales
correlations in transactional data or in medical data sets.

➢ Association rules are created by searching data for frequent if-

then patterns and using the criteria

o Support (Support indicates how frequently the if/then

relationship appears in the database.)
o Confidence (Confidence tells about the number of times
these relationships have been found to be true.)

➢ Example: Support and Confidence can be represented by the

following example

Bread=> butter [support=2%, confidence-60%]

➢ The above statement is an example of an association rule. This

means that there is a 2% transaction that bought bread and
butter together and there are 60% of customers who bought
bread as well as butter.

➢ Support and Confidence for Itemset A and B are

represented by formulas:

Prepared by Khyati Shah,SLICA Page 26

➢ To identify the most important relationships. A third metric,
called “lift”, can be used to compare confidence with
expected confidence.

➢ Association rules are calculated from itemsets, which are

made up of two or more items. If rules are built from analyzing
all the possible itemsets.
➢ Lift
The lift of a rule is defined as:

lift(X Y) = supp(X∪Y)
supp(X)*supp(Y)

➢ Association rule mining consists of 2 steps:

1. Find all the frequent itemsets.

2. Generate association rules from the above frequent
itemsets.

➢ The main applications of association rule mining:

• Basket data analysis (Market Basket analysis)

➢ It is to analyze the association of purchased items in a single

basket or single purchase.

Prepared by Khyati Shah,SLICA Page 27

➢ It is one of the key techniques used by large relations to show
associations between items. It allows retailers to identify
relationships between the items that people buy together
frequently.

➢ Example: Data is collected using barcode scanners in most

supermarkets. This database, known as the “market basket”
database, consists of a large number of records on past
transactions. A single record lists all the items bought by a
customer in one sale.

• Cross marketing

➢ It is to work with other businesses that complement your own,

not competitors. For example, vehicle dealerships and
manufacturers have cross marketing campaigns with oil and gas
companies for obvious reasons.

• Catalog design

➢ The selections of items in a business’ catalog are often designed

to complement each other so that buying one item will lead to
buying of another. So these items are often complements or very
related.

▪ Apriori Algorithm – Frequent Pattern Algorithms

➢ It is given by R. Agrawal and R. Srikant in 1994 for finding

frequent itemsets in a dataset for Boolean association rule.
Name of the algorithm is Apriori because it uses prior
knowledge of frequent itemset properties. We apply an iterative
approach or level-wise search where k-frequent item sets are
used to find k+1 item sets.

Prepared by Khyati Shah,SLICA Page 28

The steps followed in the Apriori Algorithm of data mining
are:
1. Join Step: This step generates (K+1) item set from K-item
sets by joining each item with itself.

2. Prune Step: This step scans the count of each item in the
database. If the candidate item does not meet minimum
support, then it is regarded as infrequent and thus it is
removed. This step is performed to reduce the size of the
candidate item sets.

Steps in Apriori

➢ Apriori algorithm is a sequence of steps to be followed to find

the most frequent item set in the given database.

➢ This data mining technique follows the join and the prune steps
iteratively until the most frequent item set is achieved. A
minimum support threshold is given in the problem or it is
assumed by the user.

Example

➢ Consider the following dataset and we will find frequent item

sets and generate association rules for them.

Prepared by Khyati Shah,SLICA Page 29

minimum support count is 2
minimum confidence is 60%

Step-1: K=1

1) Create a table containing support count of each item present in

dataset – Called C1(candidate set)

2) Compare candidate set item’s support count with minimum

support count (here min_support=2 if support_count of
candidate set items is less than min_support then remove those
items). This gives us itemset L1.

Step-2: K=2

Prepared by Khyati Shah,SLICA Page 30

➢ Generate candidate set C2 using L1 (this is called join step).
Condition of joining Lk-1 and Lk-1 is that it should have (K-2)
elements in common.

➢ Check all subsets of an itemset are frequent or not and if not

frequent remove that itemset.(Example subset of{I1, I2} are
{I1}, {I2} they are frequent.Check for each itemset)

➢ Now find support count of these itemsets by searching in

dataset.

2) compare candidate (C2) support count with minimum support

count(here min_support=2 if support_count of candidate set
item is less than min_support then remove those items) this
gives us itemset L2.

Prepared by Khyati Shah,SLICA Page 31

Step-3:

1)
➢ Generate candidate set C3 using L2 (join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in
common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1,
I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}

➢ Check if all subsets of these itemsets are frequent or not and if

not, then remove that itemset.(Here subset of {I1, I2, I3} are
{I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4},
subset {I3, I4} is not frequent so remove it. Similarly check for
every itemset)

➢ Find support count of these remaining itemset by searching in

dataset.

2) Compare candidate (C3) support count with minimum support

count(here min_support=2 if support_count of candidate set

Prepared by Khyati Shah,SLICA Page 32

item is less than min_support then remove those items) this
gives us itemset L3.

Step-4:

➢ Generate candidate set C4 using L3 (join step). Condition of

joining Lk-1 and Lk-1 (K=4) is that, they should have (K-2)
elements in common. So here, for L3, first 2 elements (items)
should match.

➢ Check all subsets of these itemsets are frequent or not (Here

itemset formed by joining L3 is {I1, I2, I3, I5} so its subset
contains {I1, I3, I5}, which is not frequent). So no itemset in C4

➢ We stop here because no frequent itemsets are found further

Confidence

➢ A confidence of 60% means that 60% of the customers, who

purchased milk and bread also bought butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)
➢ So here, by taking an example of any frequent itemset, we will
show the rule generation.
Itemset {I1, I2, I3} //from L3

Prepared by Khyati Shah,SLICA Page 33

➢ So rules can be

[I1Î2]=>[I3] //confidence = sup(I1Î2Î3)/sup(I1Î2) =

2/4*100=50%

[I1Î3]=>[I2] //confidence = sup(I1Î2Î3)/sup(I1Î3) =

2/4*100=50%

[I2Î3]=>[I1] //confidence = sup(I1Î2Î3)/sup(I2Î3) =

2/4*100=50%

[I1]=>[I2Î3] //confidence = sup(I1Î2Î3)/sup(I1) =

2/6*100=33%

[I2]=>[I1Î3] //confidence = sup(I1Î2Î3)/sup(I2) =

2/7*100=28%

[I3]=>[I1Î2] //confidence = sup(I1Î2Î3)/sup(I3) =

2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be

considered as strong association rules.

Examples of apriori algorithm

Prepared by Khyati Shah,SLICA Page 34

Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

Support threshold=50%, Confidence= 60%

Support threshold=50% => (50/100)*6= 3 => min_sup=3

Minimum support count =3

Confidence =60%

Transaction List of items

T1 I1,I2,I3

Prepared by Khyati Shah,SLICA Page 35

Transaction List of items

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

1. Count Of Each Item

TABLE-2

Item Count
I1 4
I2 5
I3 4
I4 4
I5 2

L1
Item Count
I1 4
I2 5
I3 4
I4 4

3. Join Step: Form 2-itemset. From TABLE-1 find out the occurrences of 2-itemset.
TABLE-4

Item Count

Prepared by Khyati Shah,SLICA Page 36

Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2

L2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet min_sup, thus it is deleted.
TABLE-5

Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3

5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out occurrences of 3-itemset. From TABLE-5,
find out the 2-itemset subsets which support min_sup.

We can see for itemset {I1, I2, I3} subsets, {I1, I2},
{I1, I3}, {I2, I3} are occurring in TABLE-5 thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1,
I4} is not frequent, as it is not occurring in TABLE-5 thus {I1, I2, I4} is not frequent,
hence it is deleted.
TABLE-6

Item
I1,I2,I3 3
I1,I2,I4
I1,I3,I4
I2,I3,I4

Prepared by Khyati Shah,SLICA Page 37

Only {I1, I2, I3} is frequent.
2)

Find the frequent itemsets and generate association rules on this. Assume that
minimum support threshold (s = 33.33%) and minimum confident threshold (c =
60%)

Let’s start,

Prepared by Khyati Shah,SLICA Page 38

DWDM UNIT-II
No ratings yet
DWDM UNIT-II
18 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Lecture 7 -Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 -Data Preprocessing - Cleaning-M
21 pages
DM-24-DATA-CLEANING
No ratings yet
DM-24-DATA-CLEANING
2 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
253777
No ratings yet
253777
66 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Data Mining
No ratings yet
Data Mining
31 pages
UNIT-2 PREPROCESSING
No ratings yet
UNIT-2 PREPROCESSING
18 pages
Lecture Notes 1.7 & 1.8
No ratings yet
Lecture Notes 1.7 & 1.8
3 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
DWM
No ratings yet
DWM
14 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Unit-1 3
No ratings yet
Unit-1 3
58 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Week2-2
No ratings yet
Week2-2
25 pages
UNIT-2
No ratings yet
UNIT-2
34 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Lecture5
No ratings yet
Lecture5
27 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Outliners
No ratings yet
Outliners
15 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
CH 2
No ratings yet
CH 2
36 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Unit 2
No ratings yet
Unit 2
46 pages
ML Assignment-1
No ratings yet
ML Assignment-1
7 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
No ratings yet
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
12 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Data Preprocessing Solution-24-37
No ratings yet
Data Preprocessing Solution-24-37
14 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Information Retrieval for Kafficho Language
No ratings yet
Information Retrieval for Kafficho Language
8 pages
BPUT Data Structure 2021
No ratings yet
BPUT Data Structure 2021
4 pages
Fake News Detection
No ratings yet
Fake News Detection
14 pages
A Framework For Social Media Data Analytics Using Elasticsearch and Kibana
No ratings yet
A Framework For Social Media Data Analytics Using Elasticsearch and Kibana
9 pages
Ipol Stu (2021) 691722 en PDF
No ratings yet
Ipol Stu (2021) 691722 en PDF
70 pages
Hands On Data Science For Biologists Using Python 1st Edition Yasha Hasija download
No ratings yet
Hands On Data Science For Biologists Using Python 1st Edition Yasha Hasija download
77 pages
Unit 7: Web Mining and Text Mining
No ratings yet
Unit 7: Web Mining and Text Mining
13 pages
LESSON 6 - Cryptography
No ratings yet
LESSON 6 - Cryptography
12 pages
Sr. Data Modeler Data Analyst Resume Madison, WI - Hire IT People - We Get IT Done 2
No ratings yet
Sr. Data Modeler Data Analyst Resume Madison, WI - Hire IT People - We Get IT Done 2
1 page
SF1 - 2019 - Grade 10 (Year IV) - SAINT MARY
No ratings yet
SF1 - 2019 - Grade 10 (Year IV) - SAINT MARY
3 pages
Generative AI Report
No ratings yet
Generative AI Report
15 pages
UNIT-4 Notes
No ratings yet
UNIT-4 Notes
30 pages
Chapter - 10 - Common Sense and Expert Systems
No ratings yet
Chapter - 10 - Common Sense and Expert Systems
13 pages
Programming Project - Password Manager (UPDATED DOCUMENTATION)
No ratings yet
Programming Project - Password Manager (UPDATED DOCUMENTATION)
17 pages
Hydroinformatics data integrative approaches in computation analysis and modeling 1st Edition Praveen Kumar pdf download
100% (2)
Hydroinformatics data integrative approaches in computation analysis and modeling 1st Edition Praveen Kumar pdf download
77 pages
Finding_Missing_Person_Using_Artificial_Intelligence
No ratings yet
Finding_Missing_Person_Using_Artificial_Intelligence
4 pages
Al'amin Project
No ratings yet
Al'amin Project
62 pages
RDBMS Unit - Ii 2023
No ratings yet
RDBMS Unit - Ii 2023
30 pages
Jitesh Thebar Resume
No ratings yet
Jitesh Thebar Resume
1 page
Download Complete Artificial Intelligence for Big Data Complete guide to automating Big Data solutions using Artificial Intelligence techniques Anand Deshpande PDF for All Chapters
100% (1)
Download Complete Artificial Intelligence for Big Data Complete guide to automating Big Data solutions using Artificial Intelligence techniques Anand Deshpande PDF for All Chapters
55 pages
Railway Bill Generation Software CO Project
No ratings yet
Railway Bill Generation Software CO Project
10 pages
European Journal of International Relations-1998-HUYSMANS-479-505
No ratings yet
European Journal of International Relations-1998-HUYSMANS-479-505
28 pages
Infosys Off-Campus Drive
No ratings yet
Infosys Off-Campus Drive
2 pages
DBMS Lab Week3
No ratings yet
DBMS Lab Week3
2 pages
Beniyam Legesse 2020
No ratings yet
Beniyam Legesse 2020
136 pages
Final PPT
No ratings yet
Final PPT
7 pages
Enhancing_Graph_Database_Interaction_through_Generative_AI-Driven_Natural_Language_Interface_for_Financial_Fraud_Detection
No ratings yet
Enhancing_Graph_Database_Interaction_through_Generative_AI-Driven_Natural_Language_Interface_for_Financial_Fraud_Detection
8 pages
Online Shopping Assistant
No ratings yet
Online Shopping Assistant
27 pages
LECTURE 4 HASH FUNCTIONS
No ratings yet
LECTURE 4 HASH FUNCTIONS
9 pages
Tab1_6726
No ratings yet
Tab1_6726
2 pages