Data Mining Review Questions / XLMiner Labs
Chapter 7 k -Nearest Neighbors (k -NN)
1. Personal Loan Acceptance. Universal Bank is a relatively young bank growing
rapidly in terms of overall customer acquisition. Universal bank wants to convert
its liability customers (depositors) into personal loan customers (while retaining
them as depositors). A campaign that the bank ran last year for liability
customers showed a healthy conversion rate of over 9% success. This has
encouraged the retail marketing department to devise smarter campaigns with
better target marketing. The goal of our analysis is to model the previous
campaigns customer behavior to analyze what combination of factors make a
customer more likely to take out a personal loan.
The file UniversalBank.xls contains data on 5,000 customers. The data include
demographic information (age, income, etc.), the customers relationship with
the bank (mortgage, securities account, etc.), and the customers response to
the last personal loan campaign (variable = Personal Loan). Among the 5,000
customers, only 480 (9.6%) accepted the personal loan offer in the last
campaign (textbook reference - 7.1).
Partition the data into training (60%) and validation (40%) sets.
a. Perform a k -NN classification with all input variables except ID and ZIP
CODE using k = 1. (Remember to transform categorical variables with two
or more categories into dummy variables). Specify the success class as
1 (loan accepted), and use the default cutoff value of 0.5. How would
the following new customer be classified using your model: Age=40,
Experience=10, Income=84, Family=2, CCAvg=2, Education_1=0,
Education_2=1, Education_3=0, Mortgage=0, Securities Account=0, CD
Account=0, Online=1, and Credit Card=1?
b. What is the choice of k that balances between overfitting and ignoring the
predictor information? (Hint: Run k-NN for k values 1 to 10).
c. Using the Confusion Matrix for the validation data in Part b, how many
customers were classified correctly? How many customers were classified
incorrectly?
d. Classify the new customer using the best k.
e. Repartition the data; this time into training, validation, and test sets
(50% : 30% : 20%). Apply the k-NN method with the k chosen above.
Compare the Confusion Matrix of the test set with that of the training and
validation sets. Comment on the differences and their reason. What is
your assessment of the performance of this model?