0% found this document useful (0 votes)
109 views11 pages

Data Mining Questions Q&A

This document provides instructions for a 3-question exam assessing skills in data mining tools WEKA, RapidMiner, and SPSS. Question 1 requires using WEKA and SPSS to perform tasks on an absenteeism dataset including feature selection, data visualization, classification/clustering, descriptive statistics, discretization, and comparison of variables. Question 2 involves using RapidMiner to construct a decision tree from a sample purchase computer dataset and addressing issues like missing/outlier handling. Question 3 describes a medical drug sample testing scenario and spreadsheet dataset for analytical deductions.

Uploaded by

aaakandoh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views11 pages

Data Mining Questions Q&A

This document provides instructions for a 3-question exam assessing skills in data mining tools WEKA, RapidMiner, and SPSS. Question 1 requires using WEKA and SPSS to perform tasks on an absenteeism dataset including feature selection, data visualization, classification/clustering, descriptive statistics, discretization, and comparison of variables. Question 2 involves using RapidMiner to construct a decision tree from a sample purchase computer dataset and addressing issues like missing/outlier handling. Question 3 describes a medical drug sample testing scenario and spreadsheet dataset for analytical deductions.

Uploaded by

aaakandoh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

SESSION: REGULAR DURATION: 2.

5 HOURS

INSTRUCTIONS:
There are five (5) Questions in this Exam with each Question worth a total of 25
Marks. Read the Questions carefully and attempt Question 1 OR Question 2, and any
other Two (2) Questions. In all, you are answering THREE (3) Questions out of the
five provided. You are expected to type your answers to Question 1 OR Question 2
(depending on the one you choose) in MS Word, and save in pdf with your
StudentID as the file name. Note that, answers to your two selected Questions from
Questions 3 to 5 must be written in the Answer Booklet provided.
Submit your saved pdf together with all relevant files by uploading them to the
Moodle LMS via the Exam thread created.
Remember Question 1 OR 2 is mandatory to answer.
The following toolkits are required for answering Questions 1 or 2.
-WEKA
-RapidMiner
-SPSS

Question 1. [25 Marks]


This Question requires the use of WEKA and SPSS toolkits. Consider the Absenteeism
dataset saved as a .txt file with the name absenteeism.txt with the attribute information
provided at Appendix A at the end of Question 5. You are expected to download the dataset
from the link. https://tinyurl.com/wvac2a6z
The dataset was created with records of absenteeism at work from July 2007 to July 2010 at a
courier company in a given country.
You are expected to perform the following tasks:
a) Create both .arff and .sav files from the given absenteeism.txt file. Save your file with
the names absenteeism.arff and absenteeism.sav
You need to make sure you consider all the necessary details needed when saving
your file as .arff i.e., @relation, @attribute and @data before calling it in WEKA.
Again, provide the necessary variable names and details for the .sav file before calling
it in SPSS. You need to submit both .arff and .sav files. [4 marks]

b) You are to call your .arff file in WEKA and consider a relevant feature selection
attribute to select your features bearing in mind the label or target feature as shown in
Appendix
A. Consider using the attribute evaluator and search method functions. Report on
your selected features as well as the feature selection algorithm used. That is provide
the total number of features selected and their respective names. [3 marks]

c) At the preprocess tab in WEKA, select the features reported in (b) using the invert
and remove buttons. Report a data visualization of the selected features together with
the target feature using the visualize all button. [1 mark]

Examiner: Dr Solomon Mensah Page 1 of 6


d) Based on the label feature, identify whether the given dataset can be used for a
classification or clustering problem. [2 marks]

e) With regard to your response in (d), use any suitable classification or clustering
technique to train and validate the dataset in WEKA. Report your result with respect to
significant information. [5 marks]

f) Call the dataset in SPSS and provide a descriptive statistics for all features. A single
table will do for this part. Report on relevant statistics (Mean, mode, median, min,
max, range, standard deviation) based on the features. [3 marks]

g) Out of the selected features, you are to discretize all the continuous features or
variables. You can consider using the recode into different variables function in SPSS.
You are to save and send the updated version of the .sav dataset bearing the recoded
variable names. [2 marks]

h) For each of your discretized variables, provide either a bar chart or pie chat. [2 marks]

i) Using the discretized variables, make any comparison between the target variable and
any of your discretized variables. You can consider using the cross tabulation
functionality in SPSS. Explain your result. [3 marks]

Question 2. [25 Marks]


This Question requires the use of RapidMiner toolkit. Consider the dataset in Table 1 to
be used to train a decision tree. The dataset comprises of the following attributes, namely
age, income, student, credit_rating and buys_computer. The buys_computer attribute is
considered as the dependent variable and the remaining attributes considered as the
independent variables. Imagine you are asked to setup a decision tree for training the
dataset, briefly explain how you will address the following issues:
a) How many features are in the dataset presented in the table below? [1 mark]
Answer: There are five(5) in the dataset
b) How many tuples are in the dataset? [1 mark]
Answer: There are fourteen (14) tuples in the dataset
c) Which feature of the dataset makes it suitable for considering a supervised
learning algorithm such as decision tree? [1 mark]
Answer: Buy Computer, as it is considered as the dependent variable that is
labelled.
d) Aside of the decision tree, list any two supervised learning algorithms that can
also be used for training the dataset. [1 mark]
Answer: random forest, linear regression, logistic regression, neural network,
support vector machine.
e) Out of the categorical variables, list two (2) dichotomous variables. [1 mark]
Answer: Student, Credit_Rating and Buy_computer.
f) Compute the information gain for each of the independent attributes. [5 marks]
g) With reference to the information gains computed in (f), determine which attribute
can be considered as the root node for the decision tree. [1 mark]
Answer: Highest info gain = Age = 0.2465
h) Complete the construction of the decision tree showing how you arrived at the tree.
Examiner: Dr Solomon Mensah Page 2 of 6
[5 marks]
i) Assume there were missing values in the dataset, discuss two ways of handling
them. [2 marks]
Answer
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with a global constant, the attribute mean, the most
probable value etc.
j) Assume there were outliers in the dataset, discuss two ways of handling them.
Answer: Remove outliers and smoothen the data statical model.
[2 marks] k) Call the Purchase Computer
dataset (Table 1) in RapidMiner and construct the tree using the various operators.
Report on the step by step procedure you considered in constructing the tree based on
your selected operators in RapidMiner. [5 marks]

Steps in setting up predictive models


1. Extract data (Primary/Secondary)
2. Preprocess extracted data (trimming and log transformation)
3. Feature selection(recommend prior feature)
4. Sample selection (Bellwethers)
5. Training + validation needs
6. Learner (Deep learning)
7. Performance evaluation(Recommend single evaluator
8. Statistical and practical significance (yuen’s test, brunners ANOVA like
test, cliff’s delta effect size.

Examiner: Dr Solomon Mensah Page 3 of 6


Table 1. Purchase Computer Dataset
age income student credit_rating buys_computer
≤30 high no fair no
≤30 high no excellent no
31-40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31-40 low yes excellent yes
≤30 medium no fair no
≤30 low yes fair yes
>40 medium yes fair yes
≤30 medium yes excellent yes
31-40 medium no excellent yes
31-40 high yes fair yes
>40 medium no excellent no

Question 3. [25 Marks]


a) Given a set 100,000 medical drug products each emerging from two (2) different
pharmaceutical companies, namely Company A and Company B to be provided to
patients at a hospital in Tema. Imagine that out of the total products, only a sample
of 300 products were tested by the Food and Drugs Authority (FDA). The FDA
found at least two defective/fake products resulting in discarding the total set of
products. After prior testing done by the two companies on each of the 100,000
drugs, it was found that 0.5% of the products emerging from Company A were
defective and none was defective from the perspective of Company B. As a
research student abreast with Data Mining techniques, you are presented with the
dataset from these two companies on a spreadsheet and need to make analytical
deductions and predictions from it. Assume there are 5 input features located on
A2:E100001 and one target feature located on F2:F100001 on the spreadsheet
respectively.
i Per the information given above about the medical products, which type of
classification algorithm will you use to perform your mining - supervised
or unsupervised classification? Provide a reason. [2 marks]
Answer: For this problem you will use Supervised because you have
labelled data (defective or not) from the FDA testing, which will allow
us to train the model to predict the target feature based on the input
features.

ii Mention any four algorithms you can use per your recommended type of
classification in (i) above. [2 marks]

Answer: random forest, logistic regression, neural network, support


vector machine.

iii. Explain any three major tasks you will undertake during preprocessing of
the data. [3 marks]
Answer
• Data cleaning: Fill in missing values, smooth noisy data, identify or remove
Examiner: Dr Solomon Mensah Page 4 of 6
outliers, and resolve inconsistencies
• Data integration: Integration of multiple databases, data cubes, or files
• Data transformation: Normalization and aggregation
• Data reduction: Obtains reduced representation in volume but produces the
same or similar analytical results
• Data discretization: Part of data reduction but with particular importance,
especially for numerical data

iv. Explain how you will normalize your dataset with any suitable
normalization technique. [2 marks]
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling

b. Explain how you will separate the dataset into the right percentages or
partitions before subjecting it to your chosen algorithm. [3 marks]
Answer: 70 for training and 30 testing Or 80 for training and 20 for
testing.

c. Do you think prediction or forecasting can be made from your chosen


model implemented from the algorithm used? If yes, how can prediction be made
for new input values. [3 marks] Yes, using regression.

Examiner: Dr Solomon Mensah Page 5 of 6


d. Give with valid evidence the type of probability model used to subject
the 300 sampled products to test.[2 marks]
Answer: Naïve Bayesian Distribution

v. Consider the following set of frequent 3-itemsets


{1,2,3}, {1,2,4}, {1,3,4}, {1,3,5}, {2,3,4} i List all candidate 4-itemsets obtained
using the candidate generation step of the Apriori algorithm. [4 marks]
ii List all candidate 4-itemsets that survive the candidate pruning step of the
Apriori algorithm before support counting. [4 marks]

Binomial Distribution:
The binomial distribution is commonly used when conducting tests involving
binary outcomes (e.g., defective or non-defective, fake or genuine).
In this scenario, the FDA is testing a sample of products to identify defects, which
is a binary outcome (defective or non-defective).
The binomial distribution describes the number of successes (defective products)
in a fixed number of independent Bernoulli trials (testing each product).
Hypothesis Testing for Proportions:

The FDA may have formulated hypotheses about the proportion of defective
products in the entire population.
They would then collect a sample of products and test whether the proportion of
defective products in the sample differs significantly from a specified value (e.g., the
proportion of defective products from Company A).
Evidence:

The scenario mentions that the FDA found "at least two defective/fake products"
in the sample of 300 products. This suggests that the FDA was interested in
determining the proportion of defective products in the sample.
The use of a binomial test or hypothesis testing for proportions aligns with the
objective of identifying defective products in a sample through statistical inference.
In summary, based on the scenario and the objective of testing the sampled
products for defects, it is likely that the FDA used a probability model such as a
binomial test or a hypothesis test for proportions. These models are commonly
employed when dealing with binary outcomes and testing hypotheses about
proportions in a population.

Question 4. [25 Marks]


Consider a dataset, namely weather with four input features – outlook,
temperature, humidity and windy. The target for the given dataset is play which is
a dichotomous variable with labels yes and no. The dataset has 14 instances with
Examiner: Dr Solomon Mensah Page 6 of 6
the target variable having 9 instances in the yes class and 5 instances in the no
class.
In the attempt of setting up a classification model for the given dataset, two main
classification algorithms, namely Naïve Bayes and Logistic Regression were set
up in WEKA and their outputs are given below in Fig. 1 and Fig. 2 respectively.

a) Comparing the two outputs from LHS and RHS above, which model will you
recommend as optimal for classification of the given dataset. [4 marks]
Ans: Naïve Bayesian model

b) Justify your answer for the best model in (a) above with valid reasons based on the
outputs presented. [7 marks]
Ans: Performance or evaluation metrics are higher in Naïve Bayesian model
than Logistic regression. For instance the weighted average recall of the
Naïve Bayesian is 0.643 which is greater than logistic regression with a value
of 0.571.

c) Explain the Confusion Matrix for your model selected in (a). [10 marks]
For Naïve Bayesian model, the TP is 8, TN is 1, FP = 4 and FN = 1
TP = 8 FN = 1
FP = 4 TN = 1

d) Imagine the yes class has 4 instances instead of 9 and the no class has 10 instances
instead of 5, which technique can be considered to increase the success (yes)
instances while maintaining the failure (no) instances. [4 marks]
ANS: SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE/
OVERSAMPLING

Examiner: Dr Solomon Mensah Page 7 of 6


Fig. 1 Fig. 2
e)
f)
g)
h)
i)
j)
k)
l)
m)
n)

Question 5. [25 Marks]


A. Assume that the support vector machine (SVM) classifier is applied on a given dataset
and the output from the classifier benchmarked against the actual labels of the dataset
is depicted in the following table:
Actual Label Y Y Y N N N N Y N N N
SVM Output Y Y N Y N Y N Y Y Y Y

a) Provide a general overview of the confusion matrix in a tabular form showing the true
positives (TP), true negatives (TN), false positives (FP) and false negatives (FN).

ANS: CONFUSION MATRIX


TP FN
FP TN

[4 marks]

b) Create a confusion matrix in a tabular form for the output from the classifier and the
actual dataset labels. [7 marks]
ANS:
TP=3 FN=1
FP= 5 TN=2

c) From the confusion matrix in (b), compute the following performance measures by
showing the step-by-step procedure involved in arriving at your results.
i. Accuracy = TP+TN/ total [2 marks]
3+2/(5+3+1+2) = 5/11
Examiner: Dr Solomon Mensah Page 8 of 6
ii. Precision = TP/TP+FP [2 marks]
3/3+5 = 3/8
iii. Recall = TP/ TP+ FN [2 marks]
3/3+1 = 3/4
iv. F-measure = 2 * [2 marks]
(precision*recall)/(precison+recall) =
3/4

B. Consider the following set of one-dimensional points: {6, 12, 18, 24, 30, 42, 48}.
For each of the following sets of initial centroids
a) {18, 45} [3 marks]
b) {15, 40} [3 marks]
create two clusters by assigning each point to the nearest centroid, and then calculate
the sum squared error for each set of two clusters after updating the centroids.

Answer: a) Cluster 1{6,12,18,24,30} Cluster 2{42,48}


New Centroid 1: 6+12+18+24+30/5 = 13.2
New Centroid 2: 42+48/2 = 45

Answer b) Cluster 1{6,12,18,24} Cluster 2{30,42,48}


New Centroid 1: 6+12+18+24/4 = 15
New Centroid 2: 30+42+48/3 = 40

TAKE HOME

Question 6:

Answer B:
Both sets of initial centroids already seem to be located close to the centers of their
respective clusters. Additionally, the clusters appear to be well-separated. Therefore, it is
likely that the K-means algorithm, when applied with these initial centroids, would
converge without any further changes in the cluster assignments. Hence, both sets of
centroids represent stable solutions for this specific dataset.

Answer C:
The output of the function is 1 only when B is 1 and A is 0. Visually, the
points corresponding to the output class 1 form a single line (B=1, A=0), which can be
linearly separated from the points corresponding to output class 0. Therefore, the
function (NOT A) AND B is linearly separable.

The output of the function is 1 only when A=0 and B=1 or A=1 and B=0. Visually, the
points corresponding to the output class 1 form two separate clusters: (A=0, B=1) and
(A=1, B=0). These clusters cannot be separated by a single straight line (hyperplane) in
the input space. Therefore, the function (A XOR B) AND (A OR B) is not linearly
separable.

Examiner: Dr Solomon Mensah Page 9 of 6


Examiner: Dr Solomon Mensah Page 10 of
6
Appendix A: Attribute Information of Absenteeism Dataset
1. Individual identification (ID)
2. Reason for absence (ICD).
Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI)
as follows:
I Certain infectious and parasitic
diseases II Neoplasms
III Diseases of the blood and blood-forming organs and certain disorders involving the
immune mechanism
IV Endocrine, nutritional and metabolic diseases
V Mental and behavioural disorders
VI Diseases of the nervous
system VII Diseases of the
eye and adnexa
VIII Diseases of the ear and mastoid process
IX Diseases of the circulatory system
X Diseases of the respiratory
system XI Diseases of the digestive
system
XII Diseases of the skin and subcutaneous tissue
XIII Diseases of the musculoskeletal system and connective
tissue XIV Diseases of the genitourinary system
XV Pregnancy, childbirth and the puerperium
XVI Certain conditions originating in the perinatal period
XVII Congenital malformations, deformations and chromosomal abnormalities
XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere
classified XIX Injury, poisoning and certain other consequences of external causes
XX External causes of morbidity and mortality
XXI Factors influencing health status and contact with health services.

And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24),
laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Work load Average/day
11. Hit target
12. Disciplinary failure (yes=1; no=0)
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
14. Son (number of children)
15. Social drinker (yes=1; no=0)
16. Social smoker (yes=1; no=0)
17. Pet (number of pet)
18. Weight
19. Height
20. Body mass index
21. Absenteeism time in hours (target)

Examiner: Dr Solomon Mensah Page 11 of


6

You might also like