0% found this document useful (0 votes)
72 views10 pages

Recsify Technologies Assignment

Uploaded by

yogdip02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views10 pages

Recsify Technologies Assignment

Uploaded by

yogdip02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Problem statement

Submitted by: Purvesh Patil (9422324279)


Based on the given financial data create a ML model to predict if the client is high risk
or low risk if we were to provide them loan. We need to predict the column Risk_Flag
and it contains value 1 if the client is high risk else it will be 0.

Here's a detailed explanation of each part of the code with a focus on the various steps of
machine learning, including data visualizations, data exploration insights, model
performance, and understanding the main deciding factors associated with risk.

1. Data Loading and Exploration


• Load the Data: The dataset is loaded from a JSON file into a pandas DataFrame.
• Data Exploration:

• data.head() displays the first five rows.


• data.info() provides information about the data types and the presence of missing
values.
• data.describe() gives a statistical summary of numerical features.

# Load the data


data = pd.read_json('loan_approval_dataset.json')

# Data Exploration
print("First five rows of the dataset:")
print(data.head())

print("\nData types and missing values:")


print(data.info())

print("\nStatistical summary:")
print(data.describe())

2. Data Visualization
• Target Variable Distribution: A count plot is created to visualize the distribution of the
Risk_Flag variable.
• Feature Distribution: Histograms for all numerical features are plotted to understand their
distributions.
# Data Visualization
plt.figure(figsize=(10, 6))
sns.countplot(x='Risk_Flag', data=data)
plt.title('Distribution of Risk Flag')
plt.savefig('risk_flag_distribution.png')
plt.show()

# Visualize the distribution of numerical features


for column in data.select_dtypes(include=['int64', 'float64']).columns:
if column != 'Id':
plt.figure(figsize=(10, 6))
sns.histplot(data[column], kde=True)
plt.title(f'Distribution of {column}')
plt.savefig(f'distribution_{column}.png')
plt.show()

3. Data Preprocessing
• Encoding Categorical Variables: Categorical features are converted to numeric using
LabelEncoder.

# Convert categorical variables to numeric


label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
label_encoders[column] = LabelEncoder()
data[column] = label_encoders[column].fit_transform(data[column])

4. Correlation Heatmap
• Heatmap: A correlation heatmap is plotted to show the correlations between features. This
helps identify multicollinearity and the relationship between features and the target variable.
# Correlation Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.savefig('correlation_heatmap.png')
plt.show()

5. Feature Engineering and Splitting Data


• Feature Engineering: The target variable Risk_Flag is separated from the feature set.
The Id column is also dropped as it doesn't provide predictive value.
• Train-Test Split: The data is split into training and testing sets (70-30 split).
• Standardization: Features are standardized to have zero mean and unit variance using
StandardScaler.

# Feature Engineering
X = data.drop(columns=['Id', 'Risk_Flag'])
y = data['Risk_Flag']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Standardizing the features


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

6. Hyperparameter Tuning and Model Building


• Hyperparameter Tuning: A grid search with cross-validation (GridSearchCV) is used to
find the best hyperparameters for the RandomForestClassifier.
• Model Training: The best model from the grid search is used to fit the training data.
# Hyperparameter tuning using GridSearchCV
param_grid = {
'n_estimators': [100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Model Building
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

7. Model Evaluation
• Predictions: The model makes predictions on the test set.
• Evaluation Metrics: The classification report, confusion matrix, and accuracy score are
printed to evaluate the model's performance.
# Predictions and Evaluation
y_pred = best_model.predict(X_test)
y_pred_prob = best_model.predict_proba(X_test)[:, 1]

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))

8. ROC Curve
• ROC Curve: The ROC curve and AUC score are plotted to evaluate the model's
performance in distinguishing between the classes.

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.savefig('roc_curve.png')
plt.show()

9. Feature Importance
• Feature Importance: The importance of each feature in the random forest model is
plotted to understand which features are the main deciding factors associated with risk.

# Feature Importance
feature_importances = best_model.feature_importances_
features = X.columns
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance':
feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance',
ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance')
plt.savefig('feature_importance.png')
plt.show()

Complete Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix,
accuracy_score, roc_curve, roc_auc_score
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.lib.utils import ImageReader

# Load the data


data = pd.read_json('loan_approval_dataset.json')

# Data Exploration
print("First five rows of the dataset:")
print(data.head())

print("\nData types and missing values:")


print(data.info())

print("\nStatistical summary:")
print(data.describe())

# Data Visualization
plt.figure(figsize=(10, 6))
sns.countplot(x='Risk_Flag', data=data)
plt.title('Distribution of Risk Flag')
plt.savefig('risk_flag_distribution.png')
plt.show()

# Visualize the distribution of numerical features


for column in data.select_dtypes(include=['int64', 'float64']).columns:
if column != 'Id':
plt.figure(figsize=(10, 6))
sns.histplot(data[column], kde=True)
plt.title(f'Distribution of {column}')
plt.savefig(f'distribution_{column}.png')
plt.show()

# Convert categorical variables to numeric


label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
label_encoders[column] = LabelEncoder()
data[column] = label_encoders[column].fit_transform(data[column])

# Correlation Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.savefig('correlation_heatmap.png')
plt.show()

# Feature Engineering
X = data.drop(columns=['Id', 'Risk_Flag'])
y = data['Risk_Flag']

# Splitting the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Standardizing the features


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Hyperparameter tuning using GridSearchCV


param_grid = {
'n_estimators': [100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Model Building
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Predictions and Evaluation


y_pred = best_model.predict(X_test)
y_pred_prob = best_model.predict_proba(X_test)[:, 1]

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.savefig('roc_curve.png')
plt.show()

# Feature Importance
feature_importances = best_model.feature_importances_
features = X.columns
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance':
feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance',
ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance')
plt.savefig('feature_importance.png')
plt.show()

Output:

You might also like