Assignment 1 DA - E Oct 2023 V1-1
Assignment 1 DA - E Oct 2023 V1-1
Assignment
Please submit your work via Canvas ,((1) your code script with comments, (2) your report in Word,
pdf or similar format, (3) a 4 minute mp4 recording). Name your final files starting as follows:
FirstName_Surname_. Have sure your code is working using only the initial dataset. Do not zip your
files.
Standard MTU penalties apply for work submitted after the due date.
Your submission should be your own work, plagiarism will be dealt with in accordance with MTU
regulations.
Note this assignment is worth 40% of this module. Reference your work appropriately.
Annotate your code with comments especially for code that is complicated; marks will be given for
these comments that display understanding of all the code you use, including code given in labs and
class.
Marks are awarded for code that is succinct and neat and the labelling of variables in a meaningful
and clear manner. Marks are also awarded for answers that have a level of individual though and
expression, so add these where possible in your comments.
Question 1
Siobhán, a manager at a financial institution has contacted you. She is asking you for assistance in assessing
the credit worthiness of future potential customers. She has a dataset of 904 past loan customer cases, with
14 attributes for each case, including attributes such as financial standing, reason for the loan, employment,
demographic information, foreign national, years residence in the district and the outcome/label variable
Credit Standing - classifying each case as either a good loan or bad loan.
Data Details
Most of the attributes are self-explanatory; the name of some of the attributes are somewhat cumbersome
but this is what you have been given; here are the further details of some of them:
Checking Acct - What level of regular checking account does the customer have –No acct, 0balance, low
(balance), high (balance)
Credit History – All paid – no credit taken or all credit paid back duly
Bank Paid – All credit at this bank paid back
Current – Existing loan/credit paid back duly till now
Critical – Risky account or other credits at other banks
Delay – Delay in paying back credit/loan in the past
Months Acct – The number of months the customer has an account with the bank.
Credibility score – A score given to applicants to reflect the credibility of them repaying the loan, using a formula
created by a data analyst and had access to all historical data.
Check – The data analyst created this field as a check on Credit Standing and had access to all historical data.
Using R or python help Siobhán answer the following questions. Make sure you explain your code, especially
the more complicated sections. If you are unable to complete some of the coding parts explain in words
with pseudo code if appropriate what you would like to do.
a) Exploratory Data Analysis (EDA): - Carry out EDA on the data set; do you notice anything unusual
(missing data, outliers, duplicates etc.) or any patterns with the data set? Detail these and outline
any actions you propose to take before you start model building in part b). Max word count 500.
10 marks
b) Split the dataset into 75% training and 25% test set using set.seed(abc) where abc are the last 3 digits
of your student no. (Use this set.seed for all other functions with an element of randomness in this
work).
c) Using the code given in the labs or otherwise, use base R (or python equivalent) to build code using
the entropy formula to split only the categorical type predictor variables. Show which predictor
variable should be used for the root node split. Use only the training set from b) to do this and you
are not constrained to binary splits.
10 marks
d) Now redo part c) but now you are constrained to only binary splits, i.e. a split with only 2 possible
outcomes. Show how this affects your results and give reasons why this is the case.
10 marks
e) Now include the continuous numeric predictor variables, again use only a binary split. Which is now
the root node split? Analyse your results and comment.
10 marks
f) Now investigate the second split, i.e. determine which next predictor variable(s) should be used to
split at the next level of the decision tree. Only binary splits are allowed again here. Detail in words
and diagrams and code and the approach you are going to use.
10 marks
g) Use the tree function from the package tree, or equivalent, build a decision tree and compare the
results to those in f) and comment. If you use pruning here you should explain all the methodology
you use.
10 marks
h) Now see if you can improve your results by using a random forest model. Give your results (5 marks)
and explain and comment (5 marks).
10 marks
i) Due to GDPR you are no longer allowed use the following variables to build your model Age,
Personal.Status and Foreign.National. Now redo your working for your best model. Give your results
and comment.
10 marks
j) Siobhán’s company uses a process that is a mixture of a grading system and human input to grade
each past loan as good or bad. Siobhán is suspicious that during a particular time that this process
performed poorly. The ID numbers can be taken as time stamp values. Develop a strategy to find a
series of consecutive ID numbers, i.e. where these gradings show a higher than normal pattern of
suspiciously incorrect or correct gradings. Detail how you go about your investigation.
10 marks
k) Select 2 parts of your answer above, e.g. (i) and (j) and record a 4 min video to demonstrate your
learning/understanding of ideally the difficult parts of these questions. Only the first 4 mins of the
recording will be viewed.
10 marks