Assignment II

Spark

Uploaded by

HPot PotTech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views11 pages

Assignment II

Spark

Uploaded by

HPot PotTech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Assignment II

Prediction of Credit Card Defaulters

Hive:
Initially I had done the dropping of id col on hive but my vm crashed and I lost all the screenshots for
that.

Understand and analyze the Dataset

First we load the data.

We then remove the ID column since it is not required.

We look at the schema.

We look at the different statistics of the numerical columns

We then look at the distribution of data of different features.
Next we see the distribution of the target variable.
As can be seen, the dataset is skewed.

Next we check if there are any null values.

Next we find the correlation between different features.

We can see that the bill_amts are highly correlated. Since we are using logistic
regression and one of the assumptions is that the features are uncorrelated,
hence we remove bill_amt2-bill_amt6.

We then change the target variable from 0/1 to No/Yes.

We transform the pay columns since we need them to be indices starting from 0
for the one hot encoder to work.
Determine the features.

We ignore bill_amt2-5 as stated above.

We first transform the categorical columns to one-hot representation

Then we vectorize all the required features so as it can be fed as input to the
logistic regression model.

We also scale the data to zero mean and unit variance.

We do all this by creating a pipeline of transformations and then fitting the

features through the pipeline.
Divide dataset

We split the dataset into train:test in 60:40 ratio.

Determine a Model and its measurement function

We define a logistic regression model and train the model on the train dataset.

Verify the Model accuracy.

We look at the area under ROC, accuracy and F1-score of our model
Use Sparkweb UI to determine which task take the most of your
execution time.

The fit command took the most time. It spawned 106 jobs with 126 stages. The maximum time in a
stage was 7 seconds as shown above.

Logistic Regression Assignment Quiz
83% (6)
Logistic Regression Assignment Quiz
7 pages
FEM 2063 - Data Analytics: CHAPTER 4: Classifications
100% (2)
FEM 2063 - Data Analytics: CHAPTER 4: Classifications
76 pages
Predicting Credit Card Approvals
100% (1)
Predicting Credit Card Approvals
14 pages
Machine Learning
100% (1)
Machine Learning
33 pages
Lead Scoring Group Case Study Presentation
100% (2)
Lead Scoring Group Case Study Presentation
19 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
BFCAI BigDataAnalytics Lecture#5 2
No ratings yet
BFCAI BigDataAnalytics Lecture#5 2
69 pages
Practical TOGAF 9 Sample Soln 2014Q2
No ratings yet
Practical TOGAF 9 Sample Soln 2014Q2
35 pages
Final Mla File For Practical
No ratings yet
Final Mla File For Practical
30 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
Reading Material - Module-5 - Introduction To Special Topics
No ratings yet
Reading Material - Module-5 - Introduction To Special Topics
27 pages
Binary Logistic
No ratings yet
Binary Logistic
29 pages
Train
No ratings yet
Train
17 pages
AI Report Presentation
No ratings yet
AI Report Presentation
14 pages
Assignment 2: Hive
No ratings yet
Assignment 2: Hive
11 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Final Report
No ratings yet
Final Report
17 pages
Question 1 The Given Dataset Can Be Visualized As Follows
No ratings yet
Question 1 The Given Dataset Can Be Visualized As Follows
13 pages
Credit Risk Project
No ratings yet
Credit Risk Project
11 pages
Logistic Regression
No ratings yet
Logistic Regression
18 pages
Loan Status Prediction
No ratings yet
Loan Status Prediction
23 pages
Data Analytics Program
No ratings yet
Data Analytics Program
11 pages
HCI ScorecardModel PPT
No ratings yet
HCI ScorecardModel PPT
9 pages
Binary Logistic Regression From Scratch
No ratings yet
Binary Logistic Regression From Scratch
10 pages
Komal ML Assg1
No ratings yet
Komal ML Assg1
9 pages
Flipkart Training: Exploratory Data Analysis
No ratings yet
Flipkart Training: Exploratory Data Analysis
9 pages
Group 9
No ratings yet
Group 9
9 pages
Omicron
No ratings yet
Omicron
23 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Credit Card Approval
No ratings yet
Credit Card Approval
15 pages
Machine Learning Model
No ratings yet
Machine Learning Model
9 pages
Mini Project
No ratings yet
Mini Project
9 pages
Documenting The Solution To Develop A Behaviour Score
No ratings yet
Documenting The Solution To Develop A Behaviour Score
9 pages
Default Payment Analysis of Credit Card Clients: July 2018
No ratings yet
Default Payment Analysis of Credit Card Clients: July 2018
7 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
Practical 3 - Categorical Feature Engineering
No ratings yet
Practical 3 - Categorical Feature Engineering
6 pages
75.an Approach For Prediction of Loan Approval Using
No ratings yet
75.an Approach For Prediction of Loan Approval Using
5 pages
PROJECTS
No ratings yet
PROJECTS
6 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
ML Hota Assign3
No ratings yet
ML Hota Assign3
4 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
Ashfatmaterial
No ratings yet
Ashfatmaterial
4 pages
PAMLSET2
No ratings yet
PAMLSET2
4 pages
Regression Log
No ratings yet
Regression Log
4 pages
Financial Risk Analysis: Great Learning PGPBABI 2017
No ratings yet
Financial Risk Analysis: Great Learning PGPBABI 2017
25 pages
Assignment 1 (Fall 2024)
No ratings yet
Assignment 1 (Fall 2024)
4 pages
PAMLSET1 New
No ratings yet
PAMLSET1 New
4 pages
ML DSBA Lab2
No ratings yet
ML DSBA Lab2
4 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
(English) Logistic Regression Nomogram (DownSub - Com)
No ratings yet
(English) Logistic Regression Nomogram (DownSub - Com)
3 pages
PCA - Colab
No ratings yet
PCA - Colab
2 pages
Assignment 2 - Machine Learning
No ratings yet
Assignment 2 - Machine Learning
3 pages
Spark Python Course APPLY Project Solution Guide Hints
No ratings yet
Spark Python Course APPLY Project Solution Guide Hints
2 pages
SML Practicals
No ratings yet
SML Practicals
4 pages
Logistic Regression
No ratings yet
Logistic Regression
2 pages
Ai Code
No ratings yet
Ai Code
2 pages
CSE23 Assignment 2 Logistic Regression
No ratings yet
CSE23 Assignment 2 Logistic Regression
2 pages
4DATA: Data Scientist M1 - Project (2020-2021) : 1-Import The Useful Library
No ratings yet
4DATA: Data Scientist M1 - Project (2020-2021) : 1-Import The Useful Library
1 page
Liton Nath
No ratings yet
Liton Nath
1 page
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
1 page
Elastic Stack 7
No ratings yet
Elastic Stack 7
280 pages
Hello, World: Artificial Intelligence and Its Use in The Public Sector
No ratings yet
Hello, World: Artificial Intelligence and Its Use in The Public Sector
185 pages
Learning Tensorflow
No ratings yet
Learning Tensorflow
9 pages
Developing Cloud Native Applications With Microservices Architecture - Google Slides
No ratings yet
Developing Cloud Native Applications With Microservices Architecture - Google Slides
1 page
Democracy Administration 1
No ratings yet
Democracy Administration 1
34 pages
ComplexArithmetic - Jupyter Notebook
No ratings yet
ComplexArithmetic - Jupyter Notebook
14 pages
ABC Guide On Citizen Engagement
No ratings yet
ABC Guide On Citizen Engagement
11 pages
Assignment I (DF)
No ratings yet
Assignment I (DF)
10 pages
Assignment I (Dataframe) : Analysis of Stocks Data
No ratings yet
Assignment I (Dataframe) : Analysis of Stocks Data
9 pages