Assighment3 4 AI Projecct
Assighment3 4 AI Projecct
COMPUTERSCIENCE DEPARTMENT
Total Marks: 0
4
Obtained Marks:
Artificial Intelligence
Assignment # 03 & 04
COMPUTERSCIENCE DEPARTMENT
Employee Attrition
Prediction using Machine
Learning
1. Project Overview
The purpose of this project is to predict employee attrition (turnover) in a hospital organization based on various
factors such as demographic characteristics, job-related attributes, and performance metrics. By predicting
employee attrition, hospitals can take proactive steps to retain valuable employees and reduce turnover-related
costs.
This document covers the following aspects:
- Data Collection and Preprocessing
- Exploratory Data Analysis (EDA)
- Statistical Analysis
- Machine Learning Model Implementation
- Model Evaluation
- Results Interpretation and Conclusion
2. Problem Statement
Employee attrition is a common issue faced by organizations across industries. High turnover rates can lead to
increased recruitment costs, loss of experienced employees, and decreased overall productivity. Understanding the
factors that contribute to employee attrition can help management take preventive actions.
Our objective is to use machine learning techniques to predict which employees are likely to leave the
organization, based on their personal characteristics, job-related factors, and performance metrics.
3. Data Collection
The dataset used for this project contains information on hospital employees, including demographic and job-
related features. It includes attributes such as:
- Age
- Gender
- Department
- Monthly Salary
- Distance from Home
- Education
- Job Role
- Marital Status
Assuming the data is in a CSV file or similar format, we will import necessary Python libraries and load
the data:
python
Copy code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Ensure that the data is cleaned and processed for analysis. For example:
python
Copy code
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})
3. Visualizations
A. Gender vs Attrition
A simple bar plot can show the turnover rate for each gender.
python
Copy code
sns.countplot(data=df, x='Gender', hue='Attrition')
plt.title('Attrition Rate by Gender')
plt.show()
B. Age vs Attrition
A box plot can show how attrition rates vary across different age groups.
python
Copy code
sns.boxplot(data=df, x='Attrition', y='Age')
plt.title('Attrition Rate by Age')
plt.show()
D. Department vs Attrition
A bar plot can show how different departments contribute to the turnover rate.
python
Copy code
sns.countplot(data=df, x='Department', hue='Attrition')
plt.title('Attrition Rate by Department')
plt.show()
python
Copy code
sns.boxplot(data=df, x='Attrition',
y='DistanceFromHome') plt.title('Attrition Rate by
Distance From Home') plt.show()
G. Education vs Attrition
A bar plot can help visualize how turnover rates change based on education level.
python
Copy code
sns.countplot(data=df, x='Education', hue='Attrition')
plt.title('Attrition Rate by Education Level')
plt.show()
H. Environment/Culture vs Attrition
A bar plot can show the correlation between workplace culture/environment and attrition.
python
Copy code
sns.countplot(data=df, x='EnvironmentCulture', hue='Attrition')
plt.title('Attrition Rate by Work Environment/Culture')
plt.show()
python
Copy code
sns.countplot(data=df, x='MaritalStatus', hue='Attrition')
plt.title('Attrition Rate by Marital Status')
plt.show()
python
Copy code
sns.boxplot(data=df, x='Attrition', y='PerformanceRating')
plt.title('Attrition Rate by Performance Rating')
plt.show()
python
Copy code
corr = df[['Age', 'DailyRate', 'DistanceFromHome', 'MonthlySalary', 'WorkExperience',
'PerformanceRating']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
After generating the above visualizations, you can analyze the trends and relationships between various
factors (such as age, gender, salary, department) and employee attrition. For example, if employees with
lower performance ratings or lower salaries have higher attrition, that could be a signal to address these
areas.
Final Note:
The above visualizations can be customized based on the specific structure of the data and your
analysis needs. If you have the dataset available, I can help you directly with the code and
visualizations tailored to it.
4o mini
To perform ANOVA (Analysis of Variance) and Chi-square tests for analyzing the relationship
between various factors (like gender, age, department, etc.) and employee turnover, we need to
understand how each factor relates to turnover, whether it's categorical or continuous, and then perform
the tests accordingly.
Here's how you can perform both ANOVA and Chi-square tests to analyze the factors affecting turnover
in the dataset.
1. ANOVA Testing
ANOVA is typically used to compare the means of continuous variables across multiple groups (such as
departments, gender, etc.) to determine if there is a significant difference in turnover rates.
Steps:
1. Check for continuous vs categorical variables: ANOVA is useful when you're comparing a
continuous variable (such as age, salary, or distance from home) with a categorical factor
(such as department or gender).
2. Perform the ANOVA test: For each continuous variable against the turnover (binary), we can
perform ANOVA.
# Perform ANOVA
f_stat, p_value = stats.f_oneway(group1, group2)
print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.4f}")
# Interpretation
if p_value < 0.05:
print(f"The difference in {var} between attrition groups is statistically significant.\n")
else:
print(f"The difference in {var} between attrition groups is not statistically significant.\n")
Explanation:
2. Chi-Square Test
The Chi-square test is used to test the association between two categorical variables. In our case, we can
check whether categorical variables like Gender, Business Travel, Department, Marital Status,
Education, Job Role, etc. are significantly related to Attrition (whether an employee left or stayed).
Steps:
1. Select categorical variables: We’ll perform Chi-square tests on factors like Gender,
BusinessTravel, Department, MaritalStatus, Education, JobRole, etc.
2. Create contingency tables: The Chi-square test works on contingency tables where the rows
represent categories of one variable (e.g., Gender) and the columns represent categories of
the other variable (Attrition).
3. Perform the Chi-square test: We use scipy.stats.chi2_contingency() to perform the test.
python
Copy code
# Import Chi-Square test function
from scipy.stats import chi2_contingency
# Interpretation
if p_val < 0.05:
print(f"The association between {var} and attrition is statistically significant.\n")
else:
print(f"The association between {var} and attrition is not statistically significant.\n")
Explanation:
● Contingency Table: A contingency table for each categorical variable shows the frequency of
occurrences for each combination of factor and attrition (e.g., Gender vs Attrition).
● The p-value will indicate whether there is a significant association between the
categorical variable and turnover (Attrition).
● If the p-value is less than 0.05, it suggests a statistically significant relationship between
the factor and employee turnover.
3. Summary of Analysis
● ANOVA will give us insight into how continuous variables (like age, salary, or work
experience) differ between employees who left and those who stayed.
● Chi-square tests will tell us whether categorical variables (such as gender, department, or
job role) have a significant association with employee turnover.
After running the above tests, the output might look something like:
yaml
Copy code
ANOVA for Age vs Attrition:
F-statistic: 2.6587, p-value: 0.0105
The difference in Age between attrition groups is statistically significant.
yaml
Copy code
Chi-Square for Gender vs Attrition:
Chi-Square statistic: 4.1253, p-value: 0.0423
The association between Gender and attrition is statistically significant.
Conclusion:
● If the p-value is less than 0.05, we can conclude that there is a statistically significant
relationship between that factor (whether categorical or continuous) and the employee
attrition rate.
4o mini
Logistic Regression is a commonly used machine learning model for binary classification tasks, such as
1. Data Preprocessing
Before applying the Logistic Regression model, we need to preprocess the data. This includes handling
missing values, encoding categorical variables, and scaling numerical features.
● Handle missing values: Check for missing values and handle them appropriately (either by
filling them or removing rows/columns).
● Encode categorical variables: Convert categorical variables (like Gender, Department, etc.)
to numerical representations using techniques like one-hot encoding or label encoding.
● Feature scaling: Logistic Regression performs better when numerical features are scaled
to similar ranges. We can use StandardScaler or MinMaxScaler for this purpose.
● Split the data: Split the dataset into a training set and a testing set (typically 80% training,
20% testing).
● Features: These are the variables that we use to predict attrition (e.g., Age, Gender,
MonthlySalary, etc.).
● Target variable: The variable to predict, in this case, Attrition (1 for leaving, 0 for staying).
Evaluate the model's performance using metrics like accuracy, precision, recall, F1 score, and the
confusion matrix. We can also plot the ROC curve to evaluate the model’s ability to distinguish between
the two classes.
Step-by-Step Code:
1. Import Libraries
python
Copy code
# Import necessary libraries
import pandas as pd
import numpy as np
2. Data Preprocessing
● Load the dataset: Assuming you have the dataset loaded in df.
python
Copy code
# Load the dataset
df = pd.read_csv('hospital_employee_data.csv')
● Check for missing values:
python
Copy code
# Check for missing values
df.isnull().sum()
● Handle missing values: You can fill missing values with mean, median, or use more
advanced imputation techniques based on the type of data.
python
Copy code
# Example: Fill missing values with the median (for numerical columns)
df.fillna(df.median(), inplace=True)
● Encode categorical variables: Use one-hot encoding or label encoding for categorical
variables like Gender, BusinessTravel, Department, etc.
python
Copy code
# Example: One-hot encode categorical variables
df = pd.get_dummies(df, drop_first=True)
● Separate features and target variable: The target variable is Attrition, and the remaining
columns are the features.
python
Copy code
# Separate features (X) and target (y)
X = df.drop('Attrition', axis=1)
python
Copy code
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
python
Copy code
# Initialize the Logistic Regression model
logreg = LogisticRegression()
python
Copy code
# Accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
● Confusion Matrix: A confusion matrix shows the number of correct and incorrect
predictions. python
Copy code
● ROC Curve and AUC: The ROC curve is useful for evaluating the model's performance in
distinguishing between the two classes (0 and 1). AUC (Area Under the Curve) provides a single
number to evaluate model performance.
python
Copy code
# Calculate the AUC score
y_prob = logreg.predict_proba(X_test)[:, 1]
auc_score = roc_auc_score(y_test, y_prob)
print(f"AUC Score: {auc_score:.4f}")
Example Output:
plaintext
Copy code
Accuracy: 0.85
Confusion Matrix:
[[ 312 23]
[ 41 54]]
Classification Report:
precision recall f1-score support
accuracy 0.85
430
macro avg 0.79 0.75 0.76 43
0
weighted avg 0.84 0.85 0.84 4
3
0
● Accuracy: The model predicts correctly 85% of the time, which is a good indicator of
performance.
● Confusion Matrix:
o True Positives (TP): 54 employees who left and were correctly predicted to leave.
o True Negatives (TN): 312 employees who stayed and were correctly predicted to stay.
o False Positives (FP): 23 employees who stayed but were predicted to leave.
o False Negatives (FN): 41 employees who left but were predicted to stay.
● AUC Score: The AUC of 0.88 suggests that the model is fairly good at distinguishing between
employees who stay and those who leave.
Random Forest is an ensemble machine learning algorithm that combines multiple decision trees to
create a stronger and more accurate model. It works well for both classification and regression tasks, and
in this case, we will use it to predict employee attrition (whether an employee will leave or stay).
Random Forest can handle both categorical and continuous data, and it is particularly good at
managing non-linear relationships and large datasets.
1. Data Preprocessing:
o Handle missing values.
o Encode categorical variables.
o Scale features (optional but recommended for better performance).
o Split the data into training and testing sets.
2. Train the Random Forest Model:
o We will use the RandomForestClassifier from sklearn to build the model.
Code Implementation:
1. Import Libraries
python
Copy code
# Load the dataset
df = pd.read_csv('hospital_employee_data.csv')
# Scale the features using StandardScaler (optional, especially for algorithms like SVM or Logistic
Regression)
scaler = StandardScaler()
X_scaled =
scaler.fit_transform(X)
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
python
Copy code
# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
Here, we use 100 trees (n_estimators=100), but you can adjust this based on the size of your data and
performance considerations.
python
Copy code
# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)
# Accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=["Stayed", "Left"],
yticklabels=["Stayed", "Left"])
plt.title('Confusion Matrix')
plt.show()
# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))
python
Copy code
Accuracy = (True Positives + True Negatives) / Total
Example Output:
plaintext
Copy code
Accuracy: 0.88
Confusion Matrix:
[[ 310 25]
ArtificialIntelligen BSSE–6A SZABIST-
ce ISB
ShaheedZulfikarAliBhuttoInstituteofScience&Technology
[ 32 63]]
accuracy 0.88
430
macro avg 0.82 0.80 0.80 43
0
weighted avg 0.87 0.88 0.87 4
3
0
● Accuracy (88%): The Random Forest model predicts the employee attrition correctly 88%
of the time, which is quite good.
● Confusion Matrix:
o True Positives (63): 63 employees who left were correctly predicted.
o True Negatives (310): 310 employees who stayed were correctly predicted.
o False Positives (25): 25 employees who stayed were incorrectly predicted to leave.
o False Negatives (32): 32 employees who left were incorrectly predicted to stay.
● Classification Report:
o Precision, recall, and F1-score values indicate that the model has a reasonably good
performance in predicting both classes (staying and leaving).
● AUC Score (0.91): The model has a high AUC score, meaning it is good at distinguishing
between employees who will stay and those who will leave.
● ROC Curve: The curve above the diagonal line shows that the model has good discriminative
ability.
Conclusion:
● The Random Forest model provides a strong performance for predicting employee attrition.
● Key metrics like accuracy, classification report, and AUC indicate that the model can
reliably predict whether an employee will leave or stay.
● Tuning: If you want to improve the model further, you can tune the hyperparameters (e.g.,
n_estimators, max_depth, min_samples_split, etc.) using techniques like GridSearchCV or
RandomizedSearchCV to find the best set of parameters.
This concludes the Random Forest model implementation for employee attrition prediction.
Support Vector Machine (SVM) is a powerful classification algorithm, particularly effective for high-
dimensional spaces. SVM finds a hyperplane that best separates the classes, and it is capable of handling
both linear and non-linear classification problems using kernel trick (e.g., linear, polynomial, or radial
basis function (RBF) kernels).
1. Data Preprocessing:
o Handle missing values.
o Encode categorical variables.
o Scale features (SVM performs better with scaled data).
o Split the dataset into training and testing sets.
2. Train the Support Vector Machine Model:
o Use SVC from sklearn to train the model with the appropriate kernel.
3. Evaluate the Model:
o Evaluate the model's performance using accuracy, confusion matrix, classification
report, and ROC AUC score.
Code Implementation:
1. Import Libraries
python
Copy code
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score,
roc_curve
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
python
Copy code
# Load the dataset
df = pd.read_csv('hospital_employee_data.csv')
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
python
Copy code
# Initialize the Support Vector Machine model (using the Radial Basis Function kernel)
svm_classifier = SVC(kernel='rbf', random_state=42)
● Kernel Choice: The Radial Basis Function (RBF) kernel is commonly used as it maps data into
higher-dimensional space, allowing it to capture non-linear patterns. You can also experiment
with other kernels like linear or polynomial depending on the dataset.
python
Copy code
# Make predictions on the test set
# Accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=["Stayed", "Left"],
yticklabels=["Stayed", "Left"])
plt.title('Confusion Matrix')
plt.show()
# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))
1. Accuracy:
o The proportion of correct predictions made by the model. It is a simple and widely
used metric to evaluate the model's performance.
2. Confusion Matrix:
o A table used to describe the performance of a classification model by comparing the
predicted and actual values:
▪ True Positives (TP): The model predicted attrition (1) and the employee actually
left.
▪ True Negatives (TN): The model predicted no attrition (0) and the employee
stayed.
▪ False Positives (FP): The model predicted attrition (1), but the employee actually
stayed.
Example Output:
plaintext
Copy code
Accuracy: 0.86
Confusion Matrix:
[[ 320 15]
[ 44 51]]
Classification Report:
precision recall f1-score support
accuracy 0.86
430
macro avg 0.83 0.75 0.77 43
0
weighted avg 0.85 0.86 0.85 4
3
0
● Accuracy (86%): The model correctly predicts employee attrition (whether they leave or stay)
86% of the time.
● Confusion Matrix:
o True Positives (51): 51 employees who left were correctly predicted to leave.
Conclusion:
● The SVM model has shown strong performance in predicting employee attrition with a good
balance of precision and recall.
● The AUC score of 0.89 suggests that the model is effective at distinguishing between
employees who stay and those who leave.
● To improve the model, you can fine-tune the hyperparameters of the SVM model (e.g., kernel, C,
gamma) using GridSearchCV or RandomizedSearchCV to optimize its performance.
Here, we will generate the ROC curve for the Support Vector Machine (SVM) model, which will help
us visualize the classifier's performance.
1. Calculate True Positive Rate (TPR): Also called sensitivity or recall. It is the proportion of
actual positives (attrition = 1) that are correctly identified by the model.
2. Calculate False Positive Rate (FPR): It is the proportion of actual negatives (non-attrition = 0)
that are incorrectly identified as positives by the model.
python
Copy code
# Import necessary libraries for plotting
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
Explanation:
● decision_function(X_test): This method gives the decision function score, which is the
distance of each sample from the decision boundary. We use this to calculate the predicted
probabilities.
● Normalization: We normalize the decision function output to lie between 0 and 1, which
is necessary for the ROC curve plot.
● roc_curve(): This function calculates the FPR, TPR, and thresholds at various points.
● roc_auc_score(): This function calculates the AUC score, which summarizes the overall
performance of the classifier.
Example Output:
● True Positive Rate (TPR): The y-axis of the ROC curve represents the proportion of true
positives, i.e., the fraction of actual attrition events that were correctly predicted by the
model.
● False Positive Rate (FPR): The x-axis of the ROC curve represents the proportion of false
positives, i.e., the fraction of employees who stayed but were incorrectly predicted to
leave.
● Diagonal line: The red dashed line represents a random classifier (no better than chance).
A good model should perform better than the diagonal line.
● Area Under the Curve (AUC): The larger the AUC, the better the model is at distinguishing
between employees who stay and those who leave. An AUC of 1 indicates perfect classification,
while an AUC of 0.5 indicates no better than random guessing.
Visualization:
● If the AUC score is 0.89, this indicates the model does a great job distinguishing between
the two classes (employees who stay and employees who leave).
● The ROC curve should be close to the top-left corner of the plot, which would indicate high
sensitivity (TPR) and low false positive rate (FPR).
● The blue curve will show the performance of your SVM model.
● The red dashed line is the diagonal, which represents a random model.
Conclusion:
● The ROC Curve helps visualize how the model performs across all classification thresholds.
● The AUC Score provides an aggregate measure of the model's ability to distinguish between
classes (attrition vs non-attrition).
● A high AUC score and a curve that is close to the top-left corner indicate a good model.
By plotting the ROC curve and calculating the AUC score, we can assess how well the Support Vector
Machine (SVM) classifier is predicting employee attrition.
COMPUTERSCIENCE DEPARTMENT
1. Coding
Hospital Dataset
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
Anova testing
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
Chi square
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
Data modeling
COMPUTERSCIENCE DEPARTMENT
Logistic Regression
COMPUTERSCIENCE DEPARTMENT
Random Forest
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT
COMPUTERSCIENCE DEPARTMENT