0% found this document useful (0 votes)
15 views12 pages

DSC Project 442

The document outlines a data science project focused on applying descriptive and inferential statistics to a bank dataset containing 2000 rows and 9 columns. It includes descriptive statistics calculations, skewness and kurtosis analysis, and various inferential statistical tests such as t-tests and ANOVA, along with a logistic regression model for predicting account status. The project is prepared by Harsh Bhavsar and includes Python code snippets for data analysis and visualization.

Uploaded by

salachapri403
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

DSC Project 442

The document outlines a data science project focused on applying descriptive and inferential statistics to a bank dataset containing 2000 rows and 9 columns. It includes descriptive statistics calculations, skewness and kurtosis analysis, and various inferential statistical tests such as t-tests and ANOVA, along with a logistic regression model for predicting account status. The project is prepared by Harsh Bhavsar and includes Python code snippets for data analysis and visualization.

Uploaded by

salachapri403
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Tab 1

IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

DATA SCIENCE PROJECT

AIM: Applying Statistics(DESCRIPTIVE & INFERENTIAL) on an BANK DATASET


with 2000 rows and 9 columns(Acocunt_ID, Customer_name, Account_type,
Balance, Branch, Status, Created_At, Loan_Amount, Interest_Rate) and applying
various statistical formulas to it.

Here are top 100 data out of Dataset:

Prepared by:Harsh Bhavsar


IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

Prepared by:Harsh Bhavsar


IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

CSV File:bank_system_data.csv

DESCRIPTIVE STATISTICS:
Here I have applied the descriptive statistics (mean, median, mode, IQR, variance,
standard deviation, and range) to the top 100 data from the csv file I uploaded.

CODE:

import pandas as pd

path="/content/drive/MyDrive/bank_system_data.csv"
df = pd.read_csv(path)

df.head(100)
try:
df = pd.read_csv('bank_system_data.csv')
display(df.head())
print(df.shape)
except FileNotFoundError:
print("Error: 'bank_system_data.csv' not found.")
df = None
except pd.errors.ParserError:
print("Error: Could not parse the CSV file.")
df = None
except Exception as e:
print(f"An unexpected error occurred: {e}")
df = None
import numpy as np

# Calculate descriptive statistics for the top 100 data points


top_100_df = df.head(100)

# Numerical features

Prepared by:Harsh Bhavsar


IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

numerical_features = ['Balance', 'Loan_Amount', 'Interest_Rate']

# Descriptive statistics
for feature in numerical_features:
print(f"Descriptive statistics for {feature}:")
print(f"Mean: {np.mean(top_100_df[feature])}")
print(f"Median: {np.median(top_100_df[feature])}")
print(f"Mode: {top_100_df[feature].mode()[0]}")
print(f"IQR: {np.percentile(top_100_df[feature], 75) - np.percentile(top_100_df[feature],
25)}")
print(f"Variance: {np.var(top_100_df[feature])}")
print(f"Standard Deviation: {np.std(top_100_df[feature])}")
print(f"Range: {np.max(top_100_df[feature]) - np.min(top_100_df[feature])}")
print("-" * 20)

OUTPUT:

Descriptive statistics for Balance:


Mean: 53137.34709999999
Median: 57411.595
Mode: 109.4
IQR: 40903.7225
Variance: 719679162.2630583
Standard Deviation: 26826.836605590648
Range: 99366.68000000001
--------------------
Descriptive statistics for Loan_Amount:
Mean: 29046.043400000002
Median: 32963.365
Mode: 35.68
IQR: 21746.47
Variance: 209940534.04584235
Standard Deviation: 14489.324830572416
Range: 49421.15
--------------------
Descriptive statistics for Interest_Rate:
Mean: 7.2698
Median: 7.655
Mode: 3.02
IQR: 4.420000000000002
Variance: 6.31287996
Standard Deviation: 2.5125445190085687
Range: 8.81
--------------------

Prepared by:Harsh Bhavsar


IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

SKEWNESS AND KURTOSIS:

CODE:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame and 'Balance', 'Loan_Amount', 'Interest_Rate' are your
numerical features
numerical_features = ['Balance', 'Loan_Amount', 'Interest_Rate']

def analyze_skewness_kurtosis(data, feature):


"""
Calculates and prints skewness and kurtosis, and displays a histogram.

Args:
data: The pandas DataFrame containing the data.
feature: The name of the numerical feature to analyze.
"""
skewness = stats.skew(data[feature])
kurtosis = stats.kurtosis(data[feature])

print(f"Analysis for {feature}:")


print(f"Skewness: {skewness:.2f}")
print(f"Kurtosis: {kurtosis:.2f}")

# Interpretation of skewness
if skewness > 0:
print("Distribution is right-skewed (positive skew).")
elif skewness < 0:
print("Distribution is left-skewed (negative skew).")
else:
print("Distribution is approximately symmetrical.")

# Interpretation of kurtosis
if kurtosis > 0:
print("Distribution is leptokurtic (heavy-tailed).")
elif kurtosis < 0:
print("Distribution is platykurtic (light-tailed).")
else:
print("Distribution is mesokurtic (normal-tailed).")

# Display histogram
plt.figure(figsize=(8, 6))
sns.histplot(data[feature], kde=True)

Prepared by:Harsh Bhavsar


IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

plt.title(f"Distribution of {feature}")
plt.xlabel(feature)
plt.ylabel("Frequency")
plt.show()
print("-" * 30)

# Apply the analysis to each numerical feature


for feature in numerical_features:
analyze_skewness_kurtosis(df, feature)

OUTPUT:
Analysis for Balance:
Skewness: -0.02
Kurtosis: -1.16
Distribution is left-skewed (negative skew).
Distribution is platykurtic (light-tailed).

------------------------------
Analysis for Loan_Amount:
Skewness: -0.08
Kurtosis: -1.21
Distribution is left-skewed (negative skew).
Distribution is platykurtic (light-tailed).

Prepared by:Harsh Bhavsar


IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

Analysis for Interest_Rate:


Skewness: 0.00
Kurtosis: -1.24
Distribution is right-skewed (positive skew).
Distribution is platykurtic (light-tailed).

Prepared by:Harsh Bhavsar


IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

INFERENTIAL STATISTICS:
Here I have applied the python algorithm to represent LINEAR REGRESSION
graphically.

CODE:

import pandas as pd
from scipy import stats
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np

# Load your dataset (or reuse existing DataFrame)


# df = pd.read_csv('your_file.csv')
df_100 = df.head(100) # First 100 rows

# 1. One-Sample T-Test: Is avg loan amount different from $25,000?


t_stat, p_val = stats.ttest_1samp(df_100['Loan_Amount'], 25000)
print("One-sample t-test (Loan_Amount vs $25,000):")
print(f"t-statistic = {t_stat:.3f}, p-value = {p_val:.3f}\n")

# 2. Two-Sample T-Test: Comparing average balances between two account types


savings = df_100[df_100['Account_Type'] == 'Savings']['Balance']
checking = df_100[df_100['Account_Type'] == 'Checking']['Balance']
t2_stat, p2_val = stats.ttest_ind(savings, checking, equal_var=False)
print("Two-sample t-test (Savings vs Checking - Balance):")
print(f"t-statistic = {t2_stat:.3f}, p-value = {p2_val:.3f}\n")

# 3. ANOVA: Compare loan amounts across account types


anova_groups = [group["Loan_Amount"].values for name, group in
df_100.groupby("Account_Type")]
f_stat, anova_p_val = stats.f_oneway(*anova_groups)
print("ANOVA (Loan_Amount ~ Account_Type):")
print(f"F-statistic = {f_stat:.3f}, p-value = {anova_p_val:.3f}\n")

# 4. Chi-Square Test: Account_Type vs Status


contingency_table = pd.crosstab(df_100['Account_Type'], df_100['Status'])
chi2, chi_p, dof, expected = stats.chi2_contingency(contingency_table)
print("Chi-square test (Account_Type vs Status):")

Prepared by:Harsh Bhavsar


IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

print(f"Chi2 = {chi2:.3f}, p-value = {chi_p:.3f}, dof = {dof}\n")

# 5. Linear Regression: Predict Loan Amount from Balance


X = df_100[['Balance']]
y = df_100['Loan_Amount']
model = LinearRegression().fit(X, y)
print("Linear Regression (Loan_Amount ~ Balance):")
print(f"Intercept: {model.intercept_:.2f}, Slope: {model.coef_[0]:.4f}")
r2 = model.score(X, y)
print(f"R-squared: {r2:.3f}")

# Optional: Plot regression


plt.figure(figsize=(8, 5))
sns.regplot(x='Balance', y='Loan_Amount', data=df_100)
plt.title('Linear Regression: Loan Amount vs Balance')
plt.tight_layout()
plt.show()

OUTPUT:

One-sample t-test (Loan_Amount vs $25,000):


t-statistic = 2.778, p-value = 0.007

Two-sample t-test (Savings vs Checking - Balance):


t-statistic = -1.192, p-value = 0.237

ANOVA (Loan_Amount ~ Account_Type):


F-statistic = 0.325, p-value = 0.723

Chi-square test (Account_Type vs Status):


Chi2 = 1.924, p-value = 0.750, dof = 4

Linear Regression (Loan_Amount ~ Balance):


Intercept: 33819.62, Slope: -0.0898
R-squared: 0.028

Prepared by:Harsh Bhavsar


IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

Machine Learning Algorithm:

—>Logistic Regression:

CODE:

# STEP 1: Import necessary libraries


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# STEP 2: Load your dataset


# If already loaded, skip this. If not, upload and load:
# from google.colab import files
# uploaded = files.upload()
# df = pd.read_csv("your_file.csv")

# STEP 3: Use first 100 rows


df_100 = df.head(100).copy()

# STEP 4: Check column names to avoid typos


print(df_100.columns)

# STEP 5: Encode target variable


df_100['Status'] = df_100['Status'].map({'Active': 1, 'Inactive': 0})

Prepared by:Harsh Bhavsar


IU2241230442 Data Science – CE0630 PROJECT 6-CSE-C2

# STEP 6: Handle missing values (if any)


df_100 = df_100.dropna(subset=['Balance', 'Loan_Amount', 'Interest_Rate', 'Status'])

# STEP 7: Define features and target


X = df_100[['Balance', 'Loan_Amount', 'Interest_Rate']]
y = df_100['Status']

# STEP 8: Train-test split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# STEP 9: Train logistic regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# STEP 10: Predict and evaluate


y_pred = model.predict(X_test)

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))

OUTPUT:

Faculty Name: Dr. Sheetal Pandya


Signature:___________________

Prepared by:Harsh Bhavsar

You might also like