Assignment II
Prediction of Credit Card Defaulters
Hive:
Initially I had done the dropping of id col on hive but my vm crashed and I lost all the screenshots for
that.
Understand and analyze the Dataset
First we load the data.
We then remove the ID column since it is not required.
We look at the schema.
We look at the different statistics of the numerical columns
We then look at the distribution of data of different features.
Next we see the distribution of the target variable.
As can be seen, the dataset is skewed.
Next we check if there are any null values.
Next we find the correlation between different features.
We can see that the bill_amts are highly correlated. Since we are using logistic
regression and one of the assumptions is that the features are uncorrelated,
hence we remove bill_amt2-bill_amt6.
We then change the target variable from 0/1 to No/Yes.
We transform the pay columns since we need them to be indices starting from 0
for the one hot encoder to work.
Determine the features.
We ignore bill_amt2-5 as stated above.
We first transform the categorical columns to one-hot representation
Then we vectorize all the required features so as it can be fed as input to the
logistic regression model.
We also scale the data to zero mean and unit variance.
We do all this by creating a pipeline of transformations and then fitting the
features through the pipeline.
Divide dataset
We split the dataset into train:test in 60:40 ratio.
Determine a Model and its measurement function
We define a logistic regression model and train the model on the train dataset.
Verify the Model accuracy.
We look at the area under ROC, accuracy and F1-score of our model
Use Sparkweb UI to determine which task take the most of your
execution time.
The fit command took the most time. It spawned 106 jobs with 126 stages. The maximum time in a
stage was 7 seconds as shown above.