Data Mining Questions Q&A
Data Mining Questions Q&A
5 HOURS
INSTRUCTIONS:
There are five (5) Questions in this Exam with each Question worth a total of 25
Marks. Read the Questions carefully and attempt Question 1 OR Question 2, and any
other Two (2) Questions. In all, you are answering THREE (3) Questions out of the
five provided. You are expected to type your answers to Question 1 OR Question 2
(depending on the one you choose) in MS Word, and save in pdf with your
StudentID as the file name. Note that, answers to your two selected Questions from
Questions 3 to 5 must be written in the Answer Booklet provided.
Submit your saved pdf together with all relevant files by uploading them to the
Moodle LMS via the Exam thread created.
Remember Question 1 OR 2 is mandatory to answer.
The following toolkits are required for answering Questions 1 or 2.
-WEKA
-RapidMiner
-SPSS
b) You are to call your .arff file in WEKA and consider a relevant feature selection
attribute to select your features bearing in mind the label or target feature as shown in
Appendix
A. Consider using the attribute evaluator and search method functions. Report on
your selected features as well as the feature selection algorithm used. That is provide
the total number of features selected and their respective names. [3 marks]
c) At the preprocess tab in WEKA, select the features reported in (b) using the invert
and remove buttons. Report a data visualization of the selected features together with
the target feature using the visualize all button. [1 mark]
e) With regard to your response in (d), use any suitable classification or clustering
technique to train and validate the dataset in WEKA. Report your result with respect to
significant information. [5 marks]
f) Call the dataset in SPSS and provide a descriptive statistics for all features. A single
table will do for this part. Report on relevant statistics (Mean, mode, median, min,
max, range, standard deviation) based on the features. [3 marks]
g) Out of the selected features, you are to discretize all the continuous features or
variables. You can consider using the recode into different variables function in SPSS.
You are to save and send the updated version of the .sav dataset bearing the recoded
variable names. [2 marks]
h) For each of your discretized variables, provide either a bar chart or pie chat. [2 marks]
i) Using the discretized variables, make any comparison between the target variable and
any of your discretized variables. You can consider using the cross tabulation
functionality in SPSS. Explain your result. [3 marks]
ii Mention any four algorithms you can use per your recommended type of
classification in (i) above. [2 marks]
iii. Explain any three major tasks you will undertake during preprocessing of
the data. [3 marks]
Answer
• Data cleaning: Fill in missing values, smooth noisy data, identify or remove
Examiner: Dr Solomon Mensah Page 4 of 6
outliers, and resolve inconsistencies
• Data integration: Integration of multiple databases, data cubes, or files
• Data transformation: Normalization and aggregation
• Data reduction: Obtains reduced representation in volume but produces the
same or similar analytical results
• Data discretization: Part of data reduction but with particular importance,
especially for numerical data
iv. Explain how you will normalize your dataset with any suitable
normalization technique. [2 marks]
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
b. Explain how you will separate the dataset into the right percentages or
partitions before subjecting it to your chosen algorithm. [3 marks]
Answer: 70 for training and 30 testing Or 80 for training and 20 for
testing.
Binomial Distribution:
The binomial distribution is commonly used when conducting tests involving
binary outcomes (e.g., defective or non-defective, fake or genuine).
In this scenario, the FDA is testing a sample of products to identify defects, which
is a binary outcome (defective or non-defective).
The binomial distribution describes the number of successes (defective products)
in a fixed number of independent Bernoulli trials (testing each product).
Hypothesis Testing for Proportions:
The FDA may have formulated hypotheses about the proportion of defective
products in the entire population.
They would then collect a sample of products and test whether the proportion of
defective products in the sample differs significantly from a specified value (e.g., the
proportion of defective products from Company A).
Evidence:
The scenario mentions that the FDA found "at least two defective/fake products"
in the sample of 300 products. This suggests that the FDA was interested in
determining the proportion of defective products in the sample.
The use of a binomial test or hypothesis testing for proportions aligns with the
objective of identifying defective products in a sample through statistical inference.
In summary, based on the scenario and the objective of testing the sampled
products for defects, it is likely that the FDA used a probability model such as a
binomial test or a hypothesis test for proportions. These models are commonly
employed when dealing with binary outcomes and testing hypotheses about
proportions in a population.
a) Comparing the two outputs from LHS and RHS above, which model will you
recommend as optimal for classification of the given dataset. [4 marks]
Ans: Naïve Bayesian model
b) Justify your answer for the best model in (a) above with valid reasons based on the
outputs presented. [7 marks]
Ans: Performance or evaluation metrics are higher in Naïve Bayesian model
than Logistic regression. For instance the weighted average recall of the
Naïve Bayesian is 0.643 which is greater than logistic regression with a value
of 0.571.
c) Explain the Confusion Matrix for your model selected in (a). [10 marks]
For Naïve Bayesian model, the TP is 8, TN is 1, FP = 4 and FN = 1
TP = 8 FN = 1
FP = 4 TN = 1
d) Imagine the yes class has 4 instances instead of 9 and the no class has 10 instances
instead of 5, which technique can be considered to increase the success (yes)
instances while maintaining the failure (no) instances. [4 marks]
ANS: SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE/
OVERSAMPLING
a) Provide a general overview of the confusion matrix in a tabular form showing the true
positives (TP), true negatives (TN), false positives (FP) and false negatives (FN).
[4 marks]
b) Create a confusion matrix in a tabular form for the output from the classifier and the
actual dataset labels. [7 marks]
ANS:
TP=3 FN=1
FP= 5 TN=2
c) From the confusion matrix in (b), compute the following performance measures by
showing the step-by-step procedure involved in arriving at your results.
i. Accuracy = TP+TN/ total [2 marks]
3+2/(5+3+1+2) = 5/11
Examiner: Dr Solomon Mensah Page 8 of 6
ii. Precision = TP/TP+FP [2 marks]
3/3+5 = 3/8
iii. Recall = TP/ TP+ FN [2 marks]
3/3+1 = 3/4
iv. F-measure = 2 * [2 marks]
(precision*recall)/(precison+recall) =
3/4
B. Consider the following set of one-dimensional points: {6, 12, 18, 24, 30, 42, 48}.
For each of the following sets of initial centroids
a) {18, 45} [3 marks]
b) {15, 40} [3 marks]
create two clusters by assigning each point to the nearest centroid, and then calculate
the sum squared error for each set of two clusters after updating the centroids.
TAKE HOME
Question 6:
Answer B:
Both sets of initial centroids already seem to be located close to the centers of their
respective clusters. Additionally, the clusters appear to be well-separated. Therefore, it is
likely that the K-means algorithm, when applied with these initial centroids, would
converge without any further changes in the cluster assignments. Hence, both sets of
centroids represent stable solutions for this specific dataset.
Answer C:
The output of the function is 1 only when B is 1 and A is 0. Visually, the
points corresponding to the output class 1 form a single line (B=1, A=0), which can be
linearly separated from the points corresponding to output class 0. Therefore, the
function (NOT A) AND B is linearly separable.
The output of the function is 1 only when A=0 and B=1 or A=1 and B=0. Visually, the
points corresponding to the output class 1 form two separate clusters: (A=0, B=1) and
(A=1, B=0). These clusters cannot be separated by a single straight line (hyperplane) in
the input space. Therefore, the function (A XOR B) AND (A OR B) is not linearly
separable.
And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24),
laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Work load Average/day
11. Hit target
12. Disciplinary failure (yes=1; no=0)
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
14. Son (number of children)
15. Social drinker (yes=1; no=0)
16. Social smoker (yes=1; no=0)
17. Pet (number of pet)
18. Weight
19. Height
20. Body mass index
21. Absenteeism time in hours (target)