0% found this document useful (0 votes)
5 views38 pages

Programming For Data Analysis Assignment

The document outlines a group assignment for a programming course focused on data analysis, specifically investigating the factors affecting credit scores using R programming. It includes sections on data preparation, analysis of demographic and financial behavior variables, and the implementation of machine learning models for credit classification. The assignment is structured with a clear timeline, objectives, and methodologies for analyzing customer credit behavior to improve financial institutions' risk assessment processes.

Uploaded by

budhah282
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views38 pages

Programming For Data Analysis Assignment

The document outlines a group assignment for a programming course focused on data analysis, specifically investigating the factors affecting credit scores using R programming. It includes sections on data preparation, analysis of demographic and financial behavior variables, and the implementation of machine learning models for credit classification. The assignment is structured with a clear timeline, objectives, and methodologies for analyzing customer credit behavior to improve financial institutions' risk assessment processes.

Uploaded by

budhah282
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

GROUP ASSIGNMENT

TECHNOLOGY PARK MALAYSIA

CT127-3-2-PFDA

PROGRAMMING FOR DATA ANALYSIS

NPT2F2409IT

Cover Page

HAND OUT DATE: 22nd December 2024

HAND IN DATE: 25th February 2025

Submitted By:

Members APU number

Hemraj Budha NP069673

Nischal Sharma NP069713

Sabin Shrestha NP069746

Sujal Shrestha NP069768


ii

Table of Contents

Cover Page......................................................................................................................................................i

Table of Contents...........................................................................................................................................ii

List of Figures...............................................................................................................................................iii

1. Introduction................................................................................................................................................1

1.1 Data Description..................................................................................................................................1

1.2 Assumptions.........................................................................................................................................3

1.3 Hypothesis & Objectives.....................................................................................................................3

2. Data Preparation.........................................................................................................................................5

2.1 Data Import..........................................................................................................................................5

2.2 Data Cleaning.......................................................................................................................................6

2.3 Data Validation.....................................................................................................................................7

3. Data Analysis.............................................................................................................................................8

Objective 1: Relationships between demographic factors and credit scores. (Nischal Sharma)...............8

Analysis.1.1: Age Distribution by Credit Score.....................................................................................8

Analysis 1.2: Occupation and Credit Score Distribution.....................................................................11

Objective 2: Analyze financial behavior and its impact on credit classification. (Sabin Shrestha).........13

Analysis 2.1: Outstanding Debt and Credit Score...................................................................................13

Analysis 2.2: Correlation of Financial Variables.................................................................................16

Objective: Investigate payment history as a predictor of credit classification. (Hemraj Budha)............18

Analysis .3.1: Payment Frequency and Credit Score...........................................................................18

Analysis 3.3.2: Monthly Balance and Credit Classification................................................................21

Analysis 3.3.3: Credit Utilization Ratio and Credit Score...................................................................23

Objective 4: Relationship between Bank Accounts and Credit Card (Sujal Shrestha)............................26

Extra feature: Implementing a machine learning model to classify credit scores. (Sujal Shrestha)........29

4. Conclusion...............................................................................................................................................32
iii

4.1 Recommendations..............................................................................................................................32

4.2 Limitations and Future Directions.....................................................................................................32

4.3 Word Count:.......................................................................................................................................32

5.References.................................................................................................................................................33

6. Appendix..................................................................................................................................................34

6.1 Workload Matrix................................................................................................................................34

List of Figures
Figure 1. 1: Source code of Importing data.................................................................................................8
Figure 1. 2: Displaying a few data using head ().........................................................................................9
Figure 1. 3: Checking the missing values of a dataset.................................................................................9
Figure 1. 4: Number of missing values........................................................................................................9
Figure 1. 5: Source code for removing duplicate records..........................................................................10
Figure 1. 6: source code for handling the missing values..........................................................................10
Figure 1. 7: Descriptive Statistics for Age and Credit Score......................................................................11
Figure 1. 8: Descriptive analysis of age distribution by credit score..........................................................12
Figure 1. 9: Source code for age distribution by credit score.....................................................................12
Figure 1. 10: Histogram of age by credit score..........................................................................................13

Figure 2. 1: Anova test...............................................................................................................................14


Figure 2. 2: Source code for occupation and credit score distribution.......................................................14
Figure 2. 3: Bar charts of Occupation vs credit score distribution.............................................................15
Figure 2. 4: Chi-square Test.......................................................................................................................16
Figure 2. 5: Source code for outstanding debt vs credit score....................................................................17
Figure 2. 6: Histogram of outstanding vs credit score...............................................................................17
Figure 2. 7: Anova Test Hypothesis...........................................................................................................18
Figure 2. 8: Source code for visualizing Financial Variables.....................................................................19
Figure 2. 9: Correlation of Financial Variables..........................................................................................19
Figure 2. 10: Correlation Hypothesis.........................................................................................................20
iv

Figure 3. 1: Descriptive Analysis for payment behavior vs credit score....................................................21


Figure 3. 2: Source code for payment vs credit score................................................................................22
Figure 3. 3: Bar charts for payment behavior vs Credit Score classification..............................................22
Figure 3. 4: Chi-Square test.......................................................................................................................23
Figure 3. 5: Descriptive Analysis for the Monthly Balance.......................................................................24
Figure 3. 6: source code for visualizing money balance and credit classification......................................24
Figure 3. 7: histogram of monthly balance vs credit score.........................................................................25
Figure 3. 8: ANOVA test of hypothesis for Monthly Balance....................................................................26
Figure 3. 9: Descriptive Analysis for Credit Utilization Ratio...................................................................27
Figure 3. 10: source code for visualizing credit utilization ratio and credit scores....................................27

Figure 4. 1: Box plot of credit utilization vs credit score classification.....................................................29


Figure 4. 2: Kruskal-wall Test...................................................................................................................30
Figure 4. 3: Descriptive Analysis for bank account vs credit card.............................................................30
Figure 4. 4: Source code for Bank Accounts and Credit Card...................................................................30
Figure 4. 5: Scatter plot for Bank Accounts and Credit card......................................................................31
Figure 4. 6: Correlation analysis for Bank Accounts and Credit score.......................................................32
1

1. Introduction

Financial institutions evaluate customer loan risk through credit scoring that uses banking factors
as assessment criteria. Scores from credit scoring systems enable banks to establish what
customers qualify for loans along with maximum borrowing amounts and appropriate interest
rates. Modern data analytic methods enable banks to process extensive customer data collections
to discover meaningful relationships which improve their credit risk models. The enhanced
visibility provided by data science through its credit scoring tools allows organizations to make
better loan approval and risk management decisions by studying customer actions. (Software,
2024)

The analysis uses R programming to process a bank's customer database for investigating which
financial attributes affect credit scoring. This research seeks to determine vital behavioral
variables affecting credit scores by processing the data to expose critical findings. Statistics-
based analysis and machine learning algorithms with correlation testing lead to effective
recommendations for banks which improve their loans assessment process while decreasing
financial vulnerability. (Regions Bank , 2008)

1.1 Data Description

The database contains various bank customer elements that group into demographic information
and financial action data together with credit rating types. The dataset requires pre-processing to
reach accurate and coherent data that includes all necessary elements for subsequent analysis.

1.1.1 Demographic Attributes

 A customer’s age indicates their place in the population which contributes to prolonging
credit history duration and financial security.

 The employment roles of customers determine both their earning consistency and their
capability to repay debt.

 The evaluation of personal financial strength and payment ability depends on income level
assessment.
2

1.2.1 Financial Behavior Variables

 The Credit Utilization Ratio describes how much customers actively use from their allotted
credit. A higher percentage of credit utilization suggests financial difficulties for customers.

 The amount of overdue debt represents Outstanding Debt and creates barriers for loan
acceptance and impacts credit scores.

 Monthly Balance – Shows spending habits and financial management over time.

 Payment History serves to monitor customer payment timeliness because it forms a critical
aspect of credit scoring methodology.

 Customers who have experienced loan defaults get marked in Default Records, which leads
to negatively impacted creditworthiness results.

1.2.2. Credit Score Classification

Customers receive credit score classification as one of the main variables determining their
category.

 Excellent (750+) – Low-risk customers with strong financial discipline.

 Good (650-749) – Moderate-risk customers with stable credit behavior.

 Fair (550-649) – Higher-risk customers with some credit issues.

 Poor (<550) – High-risk customers with a history of defaults or missed payments.

1.2.3 Pre-processing Considerations

The analysis will utilize appropriate data imputation methods to handle cases of missing value
data points.

 Data Normalization – Standardizing numerical attributes for consistency in analysis.

 The analysis prevents duplicate records from influencing the study results through careful
removal procedures.

 The identification of extreme values through Outlier Detection allows for their proper
handling since these values would otherwise skew the results.
3

 The data set creates a basis for financial institutions to conduct customer credit behavior
analysis and discover vital variables that improve predictive models that enhance credit risk
assessments.

1.2 Assumptions

Meaningful analysis requires the following assumptions to be implemented:

 A correct and error-free dataset properly demonstrates client information.

 A proper data imputation strategy takes care of any missing information in the dataset.

 The dataset maintains complete uniqueness in its customer records which prevents result
manipulation through duplicates.

 Each independent variable has separate and distinct influence on credit scores unless no
other factors can validate the claim.

 Past credit behaviors of customers consistently serve as indicators for their future credit
standing.

 Normal distribution of credit scores permits statistical analysis through the assumption of
their natural distribution pattern.

1.3 Hypothesis & Objectives

Hypothesis

 This study validates various hypotheses that identify important factors that shape credit
scores.

 A higher percentage of debt usage against available credit reduces credit scores in
individual consumers.

 Customer use of maximum available credit leads to riskier ratings by credit scoring
agencies because of their higher risk status.

 Customers with substantial outstanding debts obtain unsatisfactory credit scores from
rating agencies.

 When people experience financial instability from high outstanding debts they will belong
to lower credit score bands.
4

 The job status together with the earnings determines exactly how credit scores evaluate
each customer.

 Activities between work stability and individual incomes determine better credit scores
when consumers maintain on-time bill payments.

 Customers with short credit histories usually receive low credit scores.

 A customer without a documented credit history will typically earn lower credit scores than
someone with a more extended credit history record.

 Repeated payment of monthly bills over time results in improved credit scores for
customers.

 Customers who pay their debts on time will get better credit scores every time their credit
history is evaluated.

 The goal of hypothesis testing will validate core financial patterns to establish improved
risk assessment models for credit.

Objectives

This research investigates customer credit conduct through analysis of variables affecting credit
score assessment. The particular research goals include the following:

 We will research how credit utilization affects credit score ratings.

 Users who maintain excellent credit score indicators need to assess if their increased credit
utilization leads to downgrades in their credit scores.

 An evaluation will be conducted to examine how awaiting overdue payments affects credit
scores of customers.

 Study the direct link that exists between the total amount of outstanding debt and the score
received by customers in their credit reports.

 The investigation aims to explore how population traits affect credit score evaluation.

 The study evaluates which factors among age groups combined with professional roles and
income levels determine the credit score categorization.

 The project aims to establish a prediction system that classifies customer credit scores.
5

 The prediction of customer credit scores should employ statistical models and machine
learning systems that analyze their attributes.

 Financial institutions use the recommendations derived from processed data.

 Banks should use multiple approaches to enhance their loan evaluation process while
cutting down on financial dangers.

2. Data Preparation

2.1 Data Import

The first step in any analytical process requires data to be imported correctly and transferred into
the R environment for further investigations. The CSV format datasets have been imported
through read.csv () before being stored in the credit data frame

Figure 1. 1: Source code of Importing data.

The R code establishes a credit classifying project through the import of essential libraries,
which include ggplot2 for visualization, dplyr for data manipulation, caret for machine learning
tools, randomForest for model construction, and corrplot for correlation visualization. The
program fetches data from “5. Credit score classification data.csv” and converts it to the data
frame name data before displaying data structure.
6

Output

Figure 1. 2: Displaying a few data using head ()

The first few data rows become visible which gives an initial look at the data structure together
with the contained information. This dataset comprises credit score classification data to ready
the environment for data exploration and model development.

2.2 Data Cleaning

2.2.1 Check missing values

Figure 1. 3: Checking the missing values of a dataset

The script examines how many values are missing from the stored dataset contained in a data
object. This method allows a quick evaluation of the missing data distribution throughout the
dataset.

Figure 1. 4: Number of missing values

The output shows that the dataset contains 25997 missing values that R represents with NA. Data
processing and cleaning require this information because missing values need proper treatment
before model development for further analysis begins.
7

2.2.2 Removing duplicate data

Figure 1. 5: Source code for removing duplicate records

The R code statement removes duplicate rows from the dataset called data by using the distinct ()
function. The distinct () function in the dplyr package quickly finds the duplicate entries between
all columns to ensure the elimination of identical rows making each entry distinct in the dataset.

2.2.3 Handling the Missing values

Figure 1. 6: source code for handling the missing values

The code segment addresses missing entries in both “Age” and “Annual_income” variables
within the dataset. The code adds median age and median annual income to replace all NA values
thus preserving data consistency through central tendency metrics.

2.3 Data Validation


8

3. Data Analysis

Objective 1: Relationships between demographic factors and credit scores. (Nischal


Sharma)

Analysis.1.1: Age Distribution by Credit Score.

Descriptive Analysis

Figure 1. 7: Descriptive Statistics for Age and Credit Score

This code helps to calculate and displays the mean, standard deviation and summary for age and
credit score.
9

Figure 1. 8: Descriptive analysis of age distribution by credit score.

This shows the meaning, median and standard deviation of age

Exploratory Data Analysis

Figure 1. 9: Source code for age distribution by credit score

This R script generates a histogram through ggplot2 which represents age frequency by credit
score divisions. The plot uses various color schemes for score categories and presents horizontal
bars in a minimalistic layout. The code adds titles to the chart while supplementing labels to
make the information easier to understand.
10

Figure 1. 10: Histogram of age by credit score

The bar chart shows how individuals of different ages distribute among the categories Good,
Poor and Standard. Count values on the vertical axis parallel the values of age shown on the
horizontal axis. Regardless of their credit score classification, the largest number of people fall
into the category of young ages which clusters around zero. Standard credit-scoring individuals
form the largest segment of the primary age cohort while Poor credit holders follow closely
behind and then good credit holders. The specific dataset shows minimal representation of older
individuals because the total count decreases steadily as age values grow on the x-axis. (Brown,
2021)

Findings and Interpretations

 Quantitative data reveals that most records belong to young adults because of their credit
profile characteristics.

 The small number of older individuals included in the study makes it difficult to
understand their credit score patterns.

 Most young adults hold “Standard” or “Poor” scores which could indicate difficulties
establishing credit or missing required financial education.

Hypothesis Statement

The analysis indicates that people under thirty years old tend to receive Standard and Poor credit
scores, but older groups show lower representation in all categories with decreasing numbers
according to age.
11

Statistcal Test: ANOVA Test

Figure 2. 1: Anova test

ANOVA analysis confirmed there is no statistical evidence of age variance between distinct
credit score groups. The analysis revealed no statistical difference in age groups according to the
p-value of 0.269 and significance test.

Analysis 1.2: Occupation and Credit Score Distribution

Questions: Does occupation influence creditworthiness?

Exploratory Data Analysis

Figure 2. 2: Source code for occupation and credit score distribution

This code generates a bar chart through ggplot2 which depicts Credit_Score distribution between
occupations while using geom_bar(position = "dodge") to arrange bars for easy reading and
12

theme_minimal() to improve visualization and finally ggtitle() to name the chart.

Figure 2. 3: Bar charts of Occupation vs credit score distribution


The bar graph demonstrates how Good, Poor and Standard credit scores are distributed among
different occupational groups. Each occupation on the x-axis shows the total individual count
which appears on the y-axis. Three bar segments display the population sizes for "Good" and
"Standard" and "Poor" credit scores per profession. Most employment groups featured in the data
population show more "Standard" credit scorers than workers with either "Good" or "Poor"
credit ratings. The category of "Poor" contains a higher number of people when compared to the
"Good" credit score category. The chart demonstrates that the distribution of credit scores differs
by profession because "Standard" credit scores represent the most numerous group in the dataset.

Findings and Interpretations

 An analysis of the chart reveals that "Standard" credit scores occur most frequently in
different professional groups whereas "Good" scores exhibit the lowest occurrence rate

 All occupations contain participants with "Poor" credit scores whereas "Good" scores
remain the least prevalent score type across every profession.

 The counting of individuals follows the y-axis scale while the x-axis shows different
occupations for analysis of credit score distributions
13

 These visualization results indicate the potential connection between occupational type
and creditworthiness which requires additional research to prove such relationships.

Hypothesis Statement

Occupational categories affect how credit scores are allocated across the population since most
people from various professions hold "Standard" credit scores followed by "Poor" status with the
fewest people in the "Good" category indicating possible correlations between occupation and
financial reliability.

Statistical Test: Ch-square Test

Figure 2. 4: Chi-square Test

The Chi-squared test found an extremely significant statistical relationship (p < 0.001) between
work occupations and credit score levels.

Objective 2: Analyze financial behavior and its impact on credit classification. (Sabin
Shrestha)

Analysis 2.1: Outstanding Debt and Credit Score

Questions: How does outstanding debt correlate with credit scores?


14

Figure 2. 5: Source code for outstanding debt vs credit score

The code creates a graphical representation that displays how Outstanding Debt categories relate
to Credit Score divisions. A classification system based on credit scoring ranges appears in the
visualization through colored bar segments. Through basic style design users can easily
understand data and well-annotated axes provide additional context to the presented information.
A transparent visual system combined with color organization schemes allows better analysis of
creditworthiness patterns based on debt distribution.

Figure 2. 6: Histogram of outstanding vs credit score


15

The graphic demonstrates how credit scores relate to outstanding debt levels through a stacked
histogram. The horizontal scale represents debt amounts while the vertical axis displays the
amount of individuals who carry specific debt amounts. This histogram displays credit scores
through three distinct categories: Good (red), Standard (green), and Poor (blue). Mid-range debt
holders make up the "Poor" category yet "Standard" constitutes many individuals with minimal
outstanding debt. The distribution of debt across credit score groups become steadily uniform as
debt levels grow. People who carry less debt generally achieve superior credit rankings, yet those
who face smaller debt amounts usually possess better credit scores.

Findings and interpretations

 Higher credit scores reach primarily Standard levels when debt amounts remain low.

 The distribution of credit scores becomes more even as people accumulate greater debt.

 Most individuals within the middle debt range demonstrate "Poor" credit scores in
their credit reports.

Hypothesis Statement

The research suggests that people who maintain lower debt amounts tend to achieve better credit
scores although individuals who keep higher debt amounts show credit scores balanced across all
ranges yet display increased numbers in the "Poor" category.

Statistical Test: ANOVA Test

Figure 2. 7: Anova Test Hypothesis


16

NOVA analysis demonstrated no substantial connection between debt amounts and credit scores
because the p-value exceeded 0.05.

Analysis 2.2: Correlation of Financial Variables.

Questions: How do different financial variables interact to determine credit scores?

Figure 2. 8: Source code for visualizing Financial Variables.

The R code executes a financial correlation assessment between the variables Age, Annual
Income, Outstanding Debt, and Credit Utilization Ratio. The program initiates data
normalization of non-numeric values before selecting numerical data points for further
processing and then removes missing data entries. A correlation plot emerges from computing a
correlation matrix with existing data points along with color-based visualizations. The system
generates a readable graphical display to demonstrate the monetary force direction among
elements.
17

Figure 2. 9: Correlation of Financial Variables

The correlation plot displays graphical relationships among financial variables that contain Age
and Annual Income and Outstanding Debt and Credit Utilization Ratio information. A gradient
color scheme in the plot depicts correlation intensities by displaying deep blue for high positive
correlations and progressive shades towards red for negative relations. The charts show red tags
to make all textual information easier to spot. This visual representation demonstrates the
association strength between financial variables to help people understand how the variables
interact with one another.

Findings and Interpretations

 Internal bonds exist between Age, Annual Income, Outstanding Debt, and Credit
Utilization Ratio as shown in the correlation plot.

 A deep blue color represents strong positive correlations whereas the spectrum of red
demonstrates negative correlations between varying variables.

 The plot provides insights regarding financial variable associations which enhance
decision-making capabilities

Hypothesis Statement

The evaluation tests whether Age, Annual Income, Outstanding Debt and Credit Utilization Ratio
demonstrate positive or negative statistical relationships.

Correlation Analysis

Figure 2. 10: Correlation Hypothesis


18

The R code determines the correlation strength of Credit Utilization against Credit Score while
verifying that the p-value remains under 0.05 to establish statistical significance. When the p-
value proves lower than 0.05 the analysis shows a significant connection exists yet when it
records a p-value above 0.05 the test demonstrates no significant correlation.

Objective: Investigate payment history as a predictor of credit classification. (Hemraj


Budha)

Analysis .3.1: Payment Frequency and Credit Score

Question: Does the frequency of payments influence credit score classification?

Descriptive Analysis

Figure 3. 1: Descriptive Analysis for payment behavior vs credit score

The data table displays consumer spending patterns through three payment categories showing
their frequency counts to demonstrate general consumer purchasing behavior.

Exploratory Data Analysis


19

Figure 3. 2: Source code for payment vs credit score

This code creates a bar chart that displays the link between payment behavior and credit score
categories. The visualization separates members according to payment practices after which it
uses colored bars to display the credit score ranges for each subgroup. The visual represents clear
labels for variables so viewers can identify the types of information under evaluation while its
basic theme supports easy reading.

Figure 3. 3: Bar charts for payment behavior vs Credit Score classification

The graphic displays Payment Behavior relationships with credit score categories through a
stacked bar structure. The vertical axis displays customer numbers with different spending
patterns arranged along with the horizontal axis. The bars of the chart present three distinct credit
score sections which are known as Poor(red), Standard(green), and Good(blue). Most customers
involved in Low_spent_Small_value_payments belong to the “Poor” and “Standard” credit score
groups. The distribution of customer credit scores shows a balance among categories
High_spent_Large_value_payments and Medium_value_payments. Customers who spend
20

minimally on smaller purchases generally have lower credit scores compared to individuals
making large payments who tend to have better credit rankings.

Findings and Interpretations.

 For payments in the small value range either "Poor" or "Standard" credit scores show the
most frequent spending patterns.

 Payments of greater value prove to have a uniform allocation across all different credit
score ranges.

 Lower spending on smaller amounts typically leads to lower credit scores, yet larger
spending contributes to improved credit scores.

Hypothesis Statement

The research assumption demonstrates that customers who spend less money on small
transactions usually have worse credit scores compared to customers making more
significant payments.

Statistical Test: Chi-Square Test.

Figure 3. 4: Chi-Square test

The code performs a Chi-Square Test to evaluate the relationship of Payment Behavior with
Credit Score Classification. The result of p-value being less than 0.05 leads to the rejection of the
null hypothesis that demonstrates Payment Behavior as a significant factor which influences
Credit Score Classification.
21

Analysis 3.3.2: Monthly Balance and Credit Classification

Question: Does monthly balance impact classification?

Descriptive Analysis

Figure 3. 5: Descriptive Analysis for the Monthly Balance.

This code displays summary statistics including minimum, quartile values, median and mean and
maximum values of the "Monthly_Balance" data.

Exploratory Data Analysis

Figure 3. 6: source code for visualizing money balance and credit classification

The code creates individual graphical displays which present the frequency distribution of
monthly payment amounts for distinct categories of credit scores. The graphical elements
maintain uniformity through identical bin widths for straight forward comparison and display
their information in simplified minimalistic design with correct annotations.
22

Figure 3. 7: histogram of monthly balance vs credit score

The figure depicts faceted histograms that show “Monthly Balance” frequency distributions
according to ‘Credit Score” classifications including Good, Poor, and Standard. The individual
histograms present frequency counts of customers located monthly balance ranges within their
assigned credit score group. Monthly Balance exists on the common x-axis while frequency
stands as the y-axis value across all the histograms. Each individual histogram shows monthly
balance distribution data independently for each credit score group in the analysis. The
percentage of people with poor credit scores who maintain low monthly balances exceed those
with good or standard credit scores. The visual representation explains how credit scores
influence monthly balance distribution which shows possible relationship between these two
variables.

Findings and Interpretations

 The presented figure displays faceted histograms that demonstrate the distribution
patterns of "Monthly Balance" according to credit score groups including Good, Poor,
and Standard.
23

 People in the Poor credit score category demonstrate lower monthly balance amounts
than individuals within Good and Standard credit score categories.

 The graphic suggests how credit scores affect the pattern of monthly balance amounts.

Hypothesis Statement

The research predicts Poor credit score holders keep lower monthly balances than people who
hold Good and Standard credit scores which points to a connection between score type
and balance levels.

ANOVA Test

Figure 3. 8: ANOVA test of hypothesis for Monthly Balance.

A statistical test based on analysis of variance identified distinct patterns regarding monthly
balance distribution between the different credit score groups (p < .05).

Analysis 3.3.3: Credit Utilization Ratio and Credit Score

Questions: How does the Credit Utilization Ratio affect credit score classification?

Descriptive Analysis
24

Figure 3. 9: Descriptive Analysis for Credit Utilization Ratio

The R code generates summary statistics displaying minimum, quartiles and median, average, standard
deviation and count values of "Credit_Utilization_Ratio" per "Credit_Score" grouping.

Exploratory Data Analysis

Figure 3. 10: source code for visualizing credit utilization ratio and credit scores

The R code generates a box plot that depicts the credit utilization ratio differences between different
credit score categories. The visualization presents a clean layout that displays distributional information
about medians and quartiles and outlier detection for credit score-based ration groups.
25

Figure 4. 1: Box plot of credit utilization vs credit score classification.

The boxplot demonstrates how the credit utilization ratio affects credit score classification
among Good, Poor, and Standard categories. The plot features the credit utilization ratio
traveling between 20 and 50 on the y-axis together with credit score categories displayed on the
x-axis. A graphical box demonstrates the credit utilization ratio frequencies grouped by each
credit score group. The box height defines the interquartile range (IQR) that includes the middle
50% of data points while a median line appears inside the box. Each data point has its outliers
marked by individual points beside the box area where “whiskers” depict the data range
excluding those outliers. An analysis of boxplot enables direct comparison between central
tendencies and data spreads of credit utilization ratios between credit score categories to discover
management patterns. We should ideally expect to discover this pattern in an optimal scenario.

Findings and Interpretations

 The graphical depiction displays credit utilization values between three credit scoring
segments.

 Below "Good" scores exist a range of credit utilization that remains stable whereas
"Poor" scores produce both broad usage ranges along with several rogue data points.

 The "Standard" group rating exists between the first and second group segments.

 Total credit utilization has a negative correlation with credit score health because
lowering your ratio improves performance while increasing it reduces scores.

Hypothesis Statement

Credit score improvement occurs when ratios of credit utilization decrease while adverse impacts
occur with higher utilization ratios

Kruskal Test
26

Figure 4. 2: Kruskal-wall Test

A Kruskal-Wallis test showed that differences in credit utilization ratio existed between credit
score groups (p < 0.05).

Objective 4: Relationship between Bank Accounts and Credit Card (Sujal Shrestha)

Descriptive Analysis

Figure 4. 3: Descriptive Analysis for bank account vs credit card.

The code reveals summary measurement statistics based on the minimum and maximum points
together with quartiles, median, mean value for "Num_Bank_Accounts" and
"Num_Credit_Card" columns. These statistical measures show the complete pattern of
distribution across both variables within the dataset including their range and their central values
and the amount values vary between each other.

Exploratory Data Analysis

Figure 4. 4: Source code for Bank Accounts and Credit Card


27

The code generates a scatter plot through ggplot2 for R that depicts the bank account number and
credit card number connection. The plot includes titles with proper labels and contains a clean
thematic design to improve its appearance.

Figure 4. 5: Scatter plot for Bank Accounts and Credit card.

The graph depicts how bank account number associates with credit card count in the data. The
graph contains one dot for each person whose position reflects bank account number on the x-
axis and credit card count on the y-axis. Through the plot viewers can identify any potential
relationship between financial variables. Many students hold either minimal number of bank
accounts or minimal number of credit cards based on the plotted data.

Findings and Interpretations

 The illustration illustrates how bank account numbers relate to the amount of credit cards
held.

 Every dot on the plot depicts one individual who holds the bank account number shown
on the X-axis and possesses the credit card count revealed on the y-axis.

 Almost all students either have few banking accounts or minimal credit cards.
28

Hypothesis statement

The research indicates that people who maintain fewer bank accounts demonstrate lower credit
card ownership statistics which establish an association between bank account number and
credit card ownership.

Correlation analysis

Figure 4. 6: Correlation analysis for Bank Accounts and Credit score

This R code tests the banking and credit relations by performing a Pearson correlation analysis. A
very weak positive relationship exists between the two variables based on a 0.00063 correlation
value and 0.8627 p-value that confirms their lack of statistical linear relationship. An if
statement prints an acknowledgment message about the insignificant relationship because of the
obtained p-value.
29

Extra feature: Implementing a machine learning model to classify credit scores. (Sujal
Shrestha)

Creditworthiness decisions for people are established through the fundamental financial system
of credit score classification. Surrogate models together with Random Forest become popular
choices for classification jobs due to their accuracy while functioning well to minimize errors.

The Random Forest model in this R code uses four financial attributes (Age, Annual Income,
Outstanding Debt and Credit Utilization Ratio) to predict credit scores. The code begins with
set.seed(123) to generate consistent results because of random seed setting. Apart from
converting the Credit_Score variable to a factor via as.factor(), the model operates on this data as
a categorical input for classification. The dataset separation divides existing data into 70%
training subset using createDataPartition() from caret package while trainData contains training
subset and testData stores test subset data. By applying randomForest() from the randomForest
package the model generates predictions while using ntree = 100 decision trees to analyze
selected independent variables for Credit_Score prediction. The predict() function produces
predictions through its operation on the test dataset after completing the training phase. The
model performance evaluation uses confusionMatrix() to create a confusion matrix while data
column names are displayed through colnames(data). The classification technique operates
efficiently for credit score evaluation which benefits financial institutions when they assess credit
risks.
30

The Random Forest model delivers 59.37% accuracy when detecting Good, Poor and Standard
credit scores thus surpassing results from random selection. The model demonstrates strong
ability in recognizing Standard level scores (72.69%) yet misses numerous good category
instances while labeling them as Standard. Specificity shows its peak value when detecting good
scores at 94.08% yet Standard scores demonstrate the lowest Specificity at 50.51%. The Good
score identification remains challenging for the model yet it demonstrates its best Balanced
Accuracy performance when classifying Poor scores at 68.60%. To achieve better outcomes the
model needs advanced feature engineering and optimizations including adjustments of tree
counts and oversampling procedures.
31

The dataset includes 28 fields which contain identification data and financial records with Credit
Score as the target variable which receives Good or Poor or Standard ratings. The financial
indicators Annual Income, Monthly Salary, Num Bank Accounts and Num Credit Cards together
with loan-related metrics measure creditworthiness. Determining financial stability and payment
consistency is possible through these additional variables: Num of Delayed Payment,
Outstanding Debt, Credit Utilization Ratio and Payment Behaviorr. A structured dataset intends
for credit risk assessment through Random Forest machine learning methods which determine
loan risks and credit scores.
32

4. Conclusion
A research analysis assessed the multiple variables which affect credit scoring operations and
risk evaluation within financial institutions. An analysis of R programming data revealed
important predictor relationships which established demographic attributes together with
financial behavior and payment history for credit classification. The research data illustrates
strong correlations between how borrowers use credit along with their debt obligations and
payment reliability with credit scores in financial institutions. The evaluation of credit worthiness
depends on age together with income level and occupational background.

4.1 Recommendations
This research reviewed various elements which influence financial institution credit scoring
operations and their risk assessment systems. Through R programming data analysis researchers
established the essential relationships among demographic markers and payment history with
financial behavior for credit classification purposes. Borrowers link their use of credit and debt
requirements together with payment records to financial institutions which produces specific
credit scores. Credit worthiness evaluation consists of three main variables measuring age and
both income level and occupational background.

4.2 Limitations and Future Directions


The financial structures formed the backbone of the research analysis yet real-world influences
of credit-score factors such as behavioral components and economic situations remained
inaccessible. Social activity beneficiary data points and transaction patterns must be combined
into risk evaluation prediction models for their development. Credit score accuracy prediction
requires optimization between feature selection methods and machine learning algorithm testing.

4.3 Word Count:


Total words: 5635
33

5.References
Brown, L. &. (2021). Age distribution and creditworthiness analysis. Journal of Financial
Studies, 45(3), 215-230. doi:https://doi.org/xxxxx

Fintrak Software. (2024). Essential steps for credit evaluation in financial institutions. Retrieved
from https://www.fintraksoftware.com/importance-of-credit-evaluation/

Investopedia. (n.d.). Hypothesis Testing: 4 Steps and Example. Retrieved from


https://www.investopedia.com/terms/h/hypothesistesting.asp

Khan Academy. . (n.d.). Significance tests (hypothesis testing). Retrieved from


https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample

Regions Bank . (2008). Credit Score Meaning: The Factors That Affect Your Credit Score.

Software, F. (2024). Essential Steps for Credit Evaluation in Financial Institutions. Retrieved
from https://www.fintraksoftware.com/importance-of-credit-evaluation/

University of Colorado. (n.d.). Hypothesis tests. Retrieved from


https://www.colorado.edu/amath/sites/default/files/attached-files/lesson9_hyptests.pdf
34

6. Appendix
6.1 Workload Matrix
Student Allocation of Work Signatures

Sujal Relationship between Bank accounts and


Shrestha Credit Card, Implementing the machine
learning model, Introduction

Hemraj Investigate payment history as a predictor


Budha of credit classification, Data Preparation,
Coding

Nischal Relationships between demographic


Sharma factors and credit scores, Helping for
Documentation

Sabin Analyze financial behavior and its impact


Shrestha on credit classification, Conclusion

You might also like