0% found this document useful (0 votes)
41 views10 pages

R Programing 6 Feb

The document discusses analyzing employee attrition data using logistic regression. It involves splitting the data into training and testing sets, building a logistic regression model on the training set with Attrition as the target and MonthlyIncome, PercentSalaryHike, and JobSatisfaction as predictors, and using the model to make predictions on the testing set. The summary of the logistic model shows that two of the three predictors are statistically significant in predicting Attrition.

Uploaded by

saritasinha0207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views10 pages

R Programing 6 Feb

The document discusses analyzing employee attrition data using logistic regression. It involves splitting the data into training and testing sets, building a logistic regression model on the training set with Attrition as the target and MonthlyIncome, PercentSalaryHike, and JobSatisfaction as predictors, and using the model to make predictions on the testing set. The summary of the logistic model shows that two of the three predictors are statistically significant in predicting Attrition.

Uploaded by

saritasinha0207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Name- Sarita Kumari

Reg.No- 221180

1.ans-
library(readr)
> data<-Attrition
> View(data)
> attach(data)
The following object is masked _by_ .GlobalEnv:

Attrition

> colSums(is.na(data)) #to check missing value and the omit the value.
Age Attrition BusinessTravel
0 0 0
DailyRate Department DistanceFromHome
0 0 0
Education EducationField EmployeeCount
0 0 0
EmployeeNumber EnvironmentSatisfaction Gender
0 0 0
HourlyRate JobInvolvement JobLevel
0 0 0
JobRole JobSatisfaction MaritalStatus
0 0 0
MonthlyIncome MonthlyRate NumCompaniesWorked
0 0 0
Over18 OverTime PercentSalaryHike
0 0 0
PerformanceRating RelationshipSatisfaction StandardHours
0 0 0
StockOptionLevel TotalWorkingYears TrainingTimesLastYear
0 0 0
WorkLifeBalance YearsAtCompany YearsInCurrentRole
0 0 0
YearsSinceLastPromotion YearsWithCurrManager
0 0
> #data<-na.omit(data)
> #plot(Age,Attrition) # plot dependent and independent variables
> library(caTools) # To use sample.split function
Warning message:
package ‘caTools’ was built under R version 4.3.2
> split <- sample.split(Attrition$Age, SplitRatio = 0.8, group = NULL) # s
plit data
> training <- subset(data, split == TRUE) # Training data set 80% of data
> testing <- subset(data, split == FALSE) # Testing data set 20% of data
> dim(data)
[1] 1470 35
> dim(training) # dimension of training set
[1] 1172 35
> dim(testing) # dimension of testing set
[1] 298 35
> logistic<-glm(Attrition~MonthlyIncome+PercentSalaryHike+JobSatisfaction,
training , family = "binomial") # glm function generalised linear model. a
s you change the family change binomial.
Error in eval(family$initialize) : y values must be 0 <= y <= 1
> data$Attrition<-ifelse(data$Attrition=="Yes", 1, 0)
> View(data)
> training <- subset(data, split == TRUE) # Training data set 80% of data
> testing <- subset(data, split == FALSE) # Testing data set 20% of data
> View(training)
> logistic<-glm(Attrition~MonthlyIncome+PercentSalaryHike+JobSatisfaction,
training , family = "binomial") # glm function generalised linear model. a
s you change the family change binomial.
> #default is independent variable
> # to implement logistic reg, set family to binomial
> summary(logistic)
Call:
glm(formula = Attrition ~ MonthlyIncome + PercentSalaryHike +
JobSatisfaction, family = "binomial", data = training)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0625789 0.4196189 -0.149 0.881449
MonthlyIncome -0.0001155 0.0000233 -4.958 7.11e-07 ***
PercentSalaryHike -0.0145208 0.0227212 -0.639 0.522767
JobSatisfaction -0.2754739 0.0732322 -3.762 0.000169 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1025.53 on 1171 degrees of freedom


Residual deviance: 980.18 on 1168 degrees of freedom
AIC: 988.18

Number of Fisher Scoring iterations: 5

> # To get sigmoid curve fitting the data


> #library(ggplot2)
> #ggplot(training, aes(x = creditdebt, y = default)) +
> # geom_point() +
> # stat_smooth(method = "glm", method.args = list(family = "binomial"),
se = FALSE)
> # To get the predicted probabilities of response variable
> pred<-predict(logistic, testing, type="response")
> # To get sigmoid curve fitting the data
> #library(ggplot2)
> #ggplot(training, aes(x = creditdebt, y = default)) +
> # geom_point() +
> # stat_smooth(method = "glm", method.args = list(family = "binomial"),
se = FALSE)
> # To get the predicted probabilities of response variable
> pred<-predict(logistic, testing, type="response")
> # To get sigmoid curve fitting the data
> #library(ggplot2)
> #ggplot(training, aes(x = creditdebt, y = default)) +
> # geom_point() +
> # stat_smooth(method = "glm", method.args = list(family = "binomial"),
se = FALSE)
> # To get the predicted probabilities of response variable
> pred<-predict(logistic, testing, type="response")
> # To get sigmoid curve fitting the data
> #library(ggplot2)
> #ggplot(training, aes(x = creditdebt, y = default)) +
> # geom_point() +
> # stat_smooth(method = "glm", method.args = list(family = "binomial"),
se = FALSE)
> # To get the predicted probabilities of response variable
> pred<-predict(logistic, testing, type="response")
> # To get sigmoid curve fitting the data
> #library(ggplot2)
> #ggplot(training, aes(x = creditdebt, y = default)) +
> # geom_point() +
> # stat_smooth(method = "glm", method.args = list(family = "binomial"),
se = FALSE)
> # To get the predicted probabilities of response variable
> pred<-predict(logistic, testing, type="response")
> pred
1 2 3 4 5 6
7
0.20024488 0.15346914 0.17952894 0.17543210 0.16361258 0.16676479 0.310615
05
8 9 10 11 12 13
14
0.15354123 0.21967658 0.18614436 0.15861364 0.26511609 0.10273417 0.174648
60
15 16 17 18 19 20
21
0.06560934 0.07239086 0.12408447 0.10313682 0.31108522 0.12786258 0.075302
57
22 23 24 25 26 27
28
0.10789233 0.06733465 0.14536827 0.16876514 0.06415164 0.15287194 0.160040
31
29 30 31 32 33 34
35
0.15265439 0.12574214 0.20041690 0.13518579 0.17452897 0.17324145 0.157347
50
36 37 38 39 40 41
42
0.27128202 0.19454862 0.13655576 0.07676326 0.19730469 0.25425369 0.251396
62
43 44 45 46 47 48
49
0.21488896 0.16671664 0.28414352 0.08577444 0.16770920 0.09575969 0.209155
52
50 51 52 53 54 55
56
0.09205851 0.15938569 0.15008129 0.10836432 0.28066409 0.20585027 0.066426
33
57 58 59 60 61 62
63
0.13249435 0.03198262 0.22676063 0.11586431 0.10727830 0.12434552 0.224122
22
64 65 66 67 68 69
70
0.18440634 0.15112820 0.10476654 0.15136167 0.19748283 0.12124007 0.040595
12
71 72 73 74 75 76
77
0.14248191 0.32244473 0.16900220 0.14964397 0.15786744 0.17826758 0.038601
73
78 79 80 81 82 83
84
0.16654247 0.14834698 0.10425535 0.14009876 0.26362166 0.15450795 0.118824
29
85 86 87 88 89 90
91
0.03634700 0.29675948 0.16594351 0.03523498 0.14082244 0.17150210 0.136263
09
92 93 94 95 96 97
98
0.16458202 0.16861935 0.12511429 0.15378143 0.17205229 0.11032474 0.123984
07
99 100 101 102 103 104 1
05
0.09307485 0.18927747 0.10768126 0.29163797 0.09304976 0.18246403 0.261836
68
106 107 108 109 110 111 1
12
0.14999065 0.18341930 0.14031475 0.09673404 0.20069471 0.06870377 0.151173
26
113 114 115 116 117 118 1
19
0.19770107 0.25235286 0.14158301 0.12811661 0.13332478 0.14483505 0.268184
03
120 121 122 123 124 125 1
26
0.25266392 0.04074149 0.03699339 0.14491449 0.15699901 0.09029767 0.119953
75
127 128 129 130 131 132 1
33
0.20333133 0.17927411 0.24529834 0.11065398 0.06363171 0.04389567 0.159991
97
134 135 136 137 138 139 1
40
0.28627854 0.10858019 0.11965060 0.12726651 0.15972823 0.16209893 0.187110
51
141 142 143 144 145 146 1
47
0.14107890 0.12013264 0.13255833 0.26730746 0.25974276 0.27740382 0.119145
45
148 149 150 151 152 153 1
54
0.07468173 0.10243253 0.21391697 0.11839092 0.28381365 0.13888172 0.032956
59
155 156 157 158 159 160 1
61
0.24203909 0.27377898 0.24229109 0.17256487 0.03946586 0.10609382 0.291678
65
162 163 164 165 166 167 1
68
0.15304602 0.17825708 0.06049094 0.16016089 0.19445146 0.21255192 0.333945
53
169 170 171 172 173 174 1
75
0.20627376 0.14624246 0.23665067 0.16592799 0.08678902 0.08502547 0.177276
86
176 177 178 179 180 181 1
82
0.20347000 0.13711361 0.06325720 0.16593359 0.11978038 0.13423108 0.152526
06
183 184 185 186 187 188 1
89
0.19566540 0.03558290 0.15758773 0.30400505 0.06179434 0.04845696 0.185270
94
190 191 192 193 194 195 1
96
0.11384526 0.21502463 0.11620792 0.22779527 0.08714533 0.13983927 0.031629
70
197 198 199 200 201 202 2
03
0.21389237 0.03771750 0.25561991 0.18288646 0.18502816 0.15541831 0.323429
77
204 205 206 207 208 209 2
10
0.07960799 0.31321090 0.08367141 0.20552495 0.22796536 0.26047956 0.079046
86
211 212 213 214 215 216 2
17
0.17227606 0.09874800 0.06475164 0.28222698 0.11225082 0.23520685 0.264916
14
218 219 220 221 222 223 2
24
0.04116291 0.18124960 0.16336719 0.24388431 0.19410460 0.20372457 0.259898
92
225 226 227 228 229 230 2
31
0.15197924 0.16603001 0.19273091 0.22196056 0.05591664 0.03533738 0.082189
59
232 233 234 235 236 237 2
38
0.06261855 0.15431249 0.14795666 0.03375249 0.20411070 0.25129040 0.142593
26
239 240 241 242 243 244 2
45
0.27538542 0.17618532 0.14158972 0.11960553 0.11020570 0.05851435 0.216537
20
246 247 248 249 250 251 2
52
0.12571211 0.17555026 0.23212423 0.08702464 0.20212326 0.06280666 0.186284
40
253 254 255 256 257 258 2
59
0.16353090 0.15159381 0.14303525 0.21039216 0.06433963 0.11926537 0.257631
76
260 261 262 263 264 265 2
66
0.10935548 0.27348469 0.17360109 0.14749004 0.12981649 0.20433810 0.092409
56
267 268 269 270 271 272 2
73
0.12302650 0.10305963 0.05652699 0.19250675 0.21480113 0.14966838 0.155136
18
274 275 276 277 278 279 2
80
0.11506709 0.23768386 0.22805352 0.07761168 0.31441959 0.20976468 0.080091
77
281 282 283 284 285 286 2
87
0.24282468 0.17905375 0.15108374 0.21482310 0.20209912 0.10460777 0.227775
54
288 289 290 291 292 293 2
94
0.16629035 0.24624082 0.16845930 0.22972828 0.24415831 0.14374686 0.198609
47
295 296 297 298
0.03806454 0.08183898 0.13699932 0.15338186
> # table to analyse the accuracy of the model
> Matrix1<-(table(ActualValue=testing$Attrition, PredictedValue=pred>0.5))
> Accuracy<-sum(diag(Matrix1))/sum(Matrix1)
> Accuracy
[1] 0.8288591

Interpretation:
It is a logistic regression to predict employee attrition based on Monthly Income, Percent Salary Hike,
and Job Satisfaction. Here's an interpretation of the key components:

Model Summary:

The logistic regression model is built using the formula Attrition ~ Monthly Income + Percent Salary
Hike + Job Satisfaction on the training dataset. The model summary provides coefficients for each
predictor variable.

Coefficients:

Intercept: -0.0625789

Monthly Income: -0.0001155 (negative coefficient)

Percent Salary Hike: -0.0145208 (negative coefficient)


Job Satisfaction: -0.2754739 (negative coefficient)

Interpretation of Coefficients:

The intercept is the estimated log-odds of the baseline (reference) group.

Monthly Income, Percent Salary Hike, and Job Satisfaction coefficients represent the change in the
log-odds of the dependent variable (Attrition) for a one-unit change in each predictor.

Negative coefficients suggest a decrease in the log-odds of Attrition for an increase in the
corresponding predictor.

Significance:

The p-values associated with each coefficient indicate their significance. Variables with smaller p-
values are generally more significant. In this case, Monthly Income and Job Satisfaction seem to be
significant predictors.

Model Accuracy:

After applying the model to the testing dataset, you calculated the accuracy using a threshold of 0.5.
The accuracy of your model is approximately 82.89%.

Prediction Probabilities:

The pred variable contains the predicted probabilities of Attrition being "Yes" for each observation in
the testing set.

Confusion Matrix:

The Matrix1 table represents a confusion matrix showing the comparison between the actual and
predicted values.

Accuracy Calculation:

The Accuracy variable holds the accuracy of your model, calculated as the ratio of correctly predicted
observations to the total observations.

In conclusion, your logistic regression model, based on the given predictors, seems to have decent
predictive performance with an accuracy of around 82.89%. You may further explore model
evaluation metrics and consider refining the model by adding or modifying predictors based on the
specific context and domain knowledge.
2.Ans:

library(tidyverse)
> library(caret)
> # Load the dataset
> your_dataset <- read.csv("C:/Users/ADMIN/Downloads/Attrition.csv") # Re
place with your dataset file path
> # Data preprocessing and exploration
> selected_variables <- c("MonthlyIncome", "PercentSalaryHike", "JobSatisf
action", "PercentSalaryHike")
> selected_data <- your_dataset %>%
+ select(all_of(selected_variables)) %>%
+ na.omit() # Remove rows with missing values
> selected_data <- your_dataset %>%
+ select(all_of(selected_variables)) %>%
+ na.omit() # Remove rows with missing values
> selected_data <- your_dataset %>%
+ select(all_of(selected_variables)) %>%
+ na.omit() # Remove rows with missing values
> # Split the dataset into training and testing sets
> set.seed(123)
> train_indices <- createDataPartition(selected_data$PercentSalaryHike, p
= 0.8, list = FALSE)
> train_data <- selected_data[train_indices, ]
> test_data <- selected_data[-train_indices, ]
> # Fit a linear regression model
> model <- lm(PercentSalaryHike ~ MonthlyIncome + PercentSalaryHike + JobS
atisfaction , data = train_data)
Warning messages:
1: In model.matrix.default(mt, mf, contrasts) :
the response appeared on the right-hand side and was dropped
2: In model.matrix.default(mt, mf, contrasts) :
problem with term 2 in model.matrix: no columns are assigned
> # Make predictions on the test set
> predictions <- predict(model, newdata = test_data)
Error in qr.Q(qr.default(tR, LAPACK = TRUE))[, (p + 1L):pp, drop = FALSE]
:
subscript out of bounds
> # Fit a linear regression model
> model <- lm(PercentSalaryHike ~ MonthlyIncome + Age + JobSatisfaction ,
data = train_data)
Error in model.frame.default(formula = PercentSalaryHike ~ MonthlyIncome +
:
variable lengths differ (found for 'Age')
> # Fit a linear regression model
> model <- lm(PercentSalaryHike ~ MonthlyIncome + PercentSalaryHike + Job
Satisfaction , data = train_data)
Warning messages:
1: In model.matrix.default(mt, mf, contrasts) :
the response appeared on the right-hand side and was dropped
2: In model.matrix.default(mt, mf, contrasts) :
problem with term 2 in model.matrix: no columns are assigned
> # Fit a linear regression model
> model <- lm(PercentSalaryHike ~ MonthlyIncome + JobSatisfaction, data =
train_data)
> # Make predictions on the test set
> predictions <- predict(model, newdata = test_data)
> # Evaluate model performance (e.g., Mean Squared Error)
> mse <- mean((predictions - test_data$PercentSalaryHike)^2)
> # Evaluate model performance (e.g., Mean Squared Error)
> mse <- mean((predictions - test_data$PercentSalaryHike)^2)
> cat("Mean Squared Error:", mse, "\n")
Mean Squared Error: 13.40516
> # Summary of the linear regression model
> summary(model)

Call:
lm(formula = PercentSalaryHike ~ MonthlyIncome + JobSatisfaction,
data = train_data)

Residuals:
Min 1Q Median 3Q Max
-4.362 -3.157 -1.220 2.700 9.977

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.532e+01 3.224e-01 47.504 <2e-16 ***
MonthlyIncome -2.409e-05 2.287e-05 -1.053 0.292
JobSatisfaction 1.796e-02 9.634e-02 0.186 0.852
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.661 on 1174 degrees of freedom


Multiple R-squared: 0.0009784, Adjusted R-squared: -0.0007235
F-statistic: 0.5749 on 2 and 1174 DF, p-value: 0.5629

Interpretation:
Let's interpret the results of the linear regression model:

Coefficients:

Intercept (14.95): The intercept represents the estimated average value of Percent Salary Hike when
all other predictors are zero. In this context, it might not have a meaningful interpretation.
Monthly Income (-2.409e-05): The coefficient for Monthly Income is approximately -2.409e-05. This
implies that, on average, a one-unit increase in Monthly Income is associated with a decrease of -
2.409e-05 units in Percent Salary Hike. However, since this coefficient is very close to zero and not
statistically significant (p-value = 0.292), Monthly Income may not be a significant predictor in this
model.

Job Satisfaction (0.01796):


The coefficient for Job Satisfaction is approximately 0.01796. This implies that, on average, a one-
unit increase in Job Satisfaction is associated with an increase of 0.01796 units in Percent Salary Hike.
However, since this coefficient is not statistically significant (p-value = 0.852), Job Satisfaction may
not be a significant predictor in this model.

Residuals:

The residuals represent the differences between the observed and predicted values of Percent Salary
Hike. The summary shows statistics such as the minimum, 1st quartile, median, 3rd quartile, and
maximum values of the residuals.

Residual standard error (3.661):

This is an estimate of the standard deviation of the residuals. It indicates the average amount by
which the observed values deviate from the predicted values.

Multiple R-squared (0.0009784) and Adjusted R-squared (-0.0007235):

These values measure the proportion of variance explained by the model. In this case, the low R-
squared values suggest that the model does not explain much of the variability in Percent Salary
Hike.

F-statistic (0.5749) and p-value (0.5629):

The F-statistic tests the overall significance of the model. The p-value associated with the F-statistic is
0.5629, suggesting that the model as a whole may not be statistically significant.
In summary, based on the provided results, it seems that neither Monthly Income nor Job
Satisfaction significantly predicts Percent Salary Hike in this model. The model's overall fit is not
statistically significant, and the predictors do not have a strong relationship with the
response variable.

You might also like