Task 5
Task 5
Task 5
Load the dataset creditcard_fraud_subsample.csv from STiNE. The dataset contains trans-
actions made by credit cards in two days in September 2013 by european cardholders. It
contains only numerical input variables which are the result of some transformation (due to
confidentiality issues there is not more background on the original features available). The
only features which have not been transformed are Time and Amount, which display the seconds
elapsed between each transaction and the first transaction in the dataset and the transaction
amount. The Feature Class is the response variable and it takes value 1 in case of fraud and
0 otherwise.
1. The first task is to preprocess the dataset. Remove the feature ‘Time’ from the dataset
and standardize the feature ‘Amount’ by substracting the mean and scaling to unit
variance.
2. Split the data into a training and test set to be able to test the out of sample performance
(use 40 percent of the data for the test set).
3. Do some research on the classification method called “Gradient Boosting”, for example
using the book “Elements of Statistical Learning”. Explain the main idea and describe
how the algorithm proceeds.
4. We will use a version of gradient boosting that is called extreme gradient boosting. A pop-
ular implementation is XGBoost, see https://xgboost.readthedocs.io/en/stable/python/
python_api.html#xgboost.XGBClassifier. Hint: You have to install the XGBoost pack-
age using pip install xgboost, more info available at https://xgboost.readthedocs.
io/en/stable/install.html#python. If you have problems with installing XGBoost, you
can also use scikit-learn’s GradientBoostingClassifier, see https://scikit-learn.
org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html.
5. Use the XGBoostClassifier to predict the value of Class based on the available fea-
tures. Compute the accuracy. Compare the predictive performance to that of a logistic
regression. Also compare the accuracy with a naive classifier that simply ‘predicts’ “no
fraud” in every possible case. Why is the accuracy of the second classifier still very close
to one?
6. To be able to evaluate the performance we define the confusion matrix (binary classifi-
cation, 1 =̂ positive)
where 𝐶𝑖,𝑗 contains the number of observations of the test sample, which are in class 𝑖
and classified as class 𝑗. Write a function which takes two vectors (where each entry is
either zero or one) as input and calculates the confusion matrix. Use your function to
calculate the confusion matrix on the test set for both classifiers.
7
7. We define
𝐶1,1
precision ∶=
𝐶1,1 + 𝐶0,1
and
𝐶1,1
recall ∶= .
𝐶1,1 + 𝐶1,0
What do precision and recall measure? Write functions which take two vectors as input
and calculate precision and recall. Use your functions to evaluate your classifier (here
nan is considered a valid result if you divide by zero).
8. Finally, we combine precision and recall to the 𝐹1 -score
precision ⋅ recall
𝐹1 ∶= 2 .
precision + recall
Write a function which takes two vectors as input and calculates the 𝐹1 -score. Try
to train new classifiers (e.g. Random Forest, SVM, Logistic Regression) with different
tuning parameters and maximize the 𝐹1 -score on the test sample.
9. (Bonus): Explain the role of the parameters n_estimators and learning_rate in
XGBoost. Set the learning_rate to a value of 0.1. We want to find the optimal
value of n_estimators, i.e., the one that gives the highest 𝐹1 -score. Provide your own
implementation of cross-validation from scratch in order to find the best possible value
for n_estimators based on the cross-validated 𝐹1 score.