Data Mining Assignment 1 2023 Preprocessing and Frequent Pattern
Data Mining Assignment 1 2023 Preprocessing and Frequent Pattern
Submission Location: Submit a Word document on Google Classroom. The name of the document should be your
roll number.
Question:
In this assignment, you have to pre-process the data and identify interesting patterns in the given dataset of
Graduate University Students.
1) Data Exploration
In this assignment, pre-process and explore the given dataset using WEKA. Report your findings in a word document
and upload the document on Google classroom.
b. For each attribute, identify issues in data quality like missing value, inconsistency, noise, outliers etc.
Suggest the appropriate response if any of the above potential problems exist in specific data
attributes. For example, how you intend to handle missing values, outliers etc.
c. Analyze the attributes based on the above information. (Don't just give numerical values; also explain
in simple English what information it gave you regarding the attribute)
i. How is an attribute distributed? (normal, skewed) and
ii. Find other insights, such as which attributes can be eliminated because of little or no change
in variance (Low variance filter).
c. [5 marks] Discuss the new insights you found from visualizing and exploring the data, the techniques you tested,
and the results you obtained. You can include the different graphs and plots you have used for visualization,
but do examples in plain English.
After data exploration, your task is to pre-process the given dataset and find trends and patterns using
association rule mining. Pre-processing includes data discretization (binning), data reduction, data smoothing,
and feature selection. Explain your choices, such as why you selected equal frequency or width binning. Also,
explain your choices for normalization and data reduction.
NOTE: Data pre-processing and frequent pattern mining is an iterative process. You may need to pre-process
data multiple times to identify exciting and valuable rules that give new insights.
Experiment with different parameters to extract strong rules (e.g., rules with high lift and confidence, which at
the same time have relatively good support). Convert the dataset into a form suitable for Association Rule
Mining. Pre-process the attributes so you can see some patterns in data and extract rules using Apriori.
1. [10 points] Use confidence as an interestingness measure of an association rule. Rank the top 10
association rules for at least the three different combinations of support and confidence. Explain the rules
and why you consider them interesting and valuable. Furthermore, also give recommendations based on
the discovered rules that might help the user.
2. [10 points] Use interest as an interestingness measure of an association rule. Rank the top 10 association
rules for at least three combinations of support and interest. Explain the rules and why you consider it
interesting and useful. Furthermore, also give recommendations based on the discovered rules that might
help the user.
3. [10 points] Try to formulate some questions that you want to ask of your rule learning extraction systems.
Select the attributes that will be required to answer your questions. Run Association rule mining to
extract interesting patterns. Show at least 10 rules. Explain the rules and why you consider them
interesting and useful. Explain what insight you got regarding your questions.
a. For example, one may want to find the effect of the number of study hours, job, marital status,
and highly educated parents on CGPA. To figure this out, select the appropriate attributes, pre-
process them, and run apriori. You can set the class attributes in Weka to find rules about a
particular attribute.
Note: The top 5 most interesting rules are most likely not the top 5 in the result set of the Apriori algorithm.
They are rules that, in addition to having high support, lift, and confidence, also gives some non-trivial, useful
information based on the underlying business objectives.
Submission: Do include the different graphs and plots that you have used for visualization.