Data processing_unit-3
Data processing_unit-3
1. Completeness
2. Consistency
3. Accuracy
4. Validity
5. Timeliness
1. Data Completeness
3. Data Accuracy
4. Data Validity
5. Data Timeliness
▪ Data Cleaning
o Missing Values
1. Binning
➢ For example
Methods of binning
Bin 1:8,8,8
Bin 2:21,21,21
Bin 3:28,28,28
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
For Bin 1:
(8+ 9 + 15 +16 / 4) = 12
(4 indicating the total values like 8, 9 , 15, 16)
Bin 1 = 12, 12, 12, 12
For Bin 2:
(21 + 21 + 24 + 26 / 4) = 23
Bin 2 = 23, 23, 23, 23
For Bin 3:
(27 + 30 + 30 + 34 / 4) = 30
Prepared by Khyati Shah,SLICA Page 9
Bin 3 = 30, 30, 30, 30
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Answer
Bin 1: 8, 8, 16, 16
Bin 2: 21,21,26,26
Bin 3:27,27,27,34
(9+15)/2=12
Bin 1: 12,12,12,12
Bin 2: 23,23,23,23
Bin 3: 30,30,30,30
2. Regression:
1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression
Application of Regression
➢ Regression is a very popular technique, and it has wide
applications in businesses and industries. The regression
procedure involves the predictor variable and response variable.
The major application of regression is given below.
3. Outlier analysis:
Outlier
Example
▪ Data Reduction
➢ Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume,
That is, mining on the reduced data set should be more efficient
yet produce the same (or almost the same) analytical results.
2. Numerosity reduction
3. Data compression
Histograms
➢ Example
➢ Types of Sampling
➢ Example
3. Cluster sample
4. Stratified sample:
lift(X Y) = supp(X∪Y)
supp(X)*supp(Y)
• Cross marketing
• Catalog design
2. Prune Step: This step scans the count of each item in the
database. If the candidate item does not meet minimum
support, then it is regarded as infrequent and thus it is
removed. This step is performed to reduce the size of the
candidate item sets.
Steps in Apriori
➢ This data mining technique follows the join and the prune steps
iteratively until the most frequent item set is achieved. A
minimum support threshold is given in the problem or it is
assumed by the user.
Example
Step-1: K=1
Step-2: K=2
1)
1)
➢ Generate candidate set C3 using L2 (join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in
common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1,
I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
Step-4:
Confidence
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
➢ So here, by taking an example of any frequent itemset, we will
show the rule generation.
Itemset {I1, I2, I3} //from L3
1)
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
L1
Item Count
I1 4
I2 5
I3 4
I4 4
3. Join Step: Form 2-itemset. From TABLE-1 find out the occurrences of 2-itemset.
TABLE-4
Item Count
L2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet min_sup, thus it is deleted.
TABLE-5
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out occurrences of 3-itemset. From TABLE-5,
find out the 2-itemset subsets which support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2},
{I1, I3}, {I2, I3} are occurring in TABLE-5 thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1,
I4} is not frequent, as it is not occurring in TABLE-5 thus {I1, I2, I4} is not frequent,
hence it is deleted.
TABLE-6
Item
I1,I2,I3 3
I1,I2,I4
I1,I3,I4
I2,I3,I4
3)
Find the frequent itemsets and generate association rules on this. Assume that
minimum support threshold (s = 33.33%) and minimum confident threshold (c =
60%)
Let’s start,