R Programing 6 Feb
R Programing 6 Feb
Reg.No- 221180
1.ans-
library(readr)
> data<-Attrition
> View(data)
> attach(data)
The following object is masked _by_ .GlobalEnv:
Attrition
> colSums(is.na(data)) #to check missing value and the omit the value.
Age Attrition BusinessTravel
0 0 0
DailyRate Department DistanceFromHome
0 0 0
Education EducationField EmployeeCount
0 0 0
EmployeeNumber EnvironmentSatisfaction Gender
0 0 0
HourlyRate JobInvolvement JobLevel
0 0 0
JobRole JobSatisfaction MaritalStatus
0 0 0
MonthlyIncome MonthlyRate NumCompaniesWorked
0 0 0
Over18 OverTime PercentSalaryHike
0 0 0
PerformanceRating RelationshipSatisfaction StandardHours
0 0 0
StockOptionLevel TotalWorkingYears TrainingTimesLastYear
0 0 0
WorkLifeBalance YearsAtCompany YearsInCurrentRole
0 0 0
YearsSinceLastPromotion YearsWithCurrManager
0 0
> #data<-na.omit(data)
> #plot(Age,Attrition) # plot dependent and independent variables
> library(caTools) # To use sample.split function
Warning message:
package ‘caTools’ was built under R version 4.3.2
> split <- sample.split(Attrition$Age, SplitRatio = 0.8, group = NULL) # s
plit data
> training <- subset(data, split == TRUE) # Training data set 80% of data
> testing <- subset(data, split == FALSE) # Testing data set 20% of data
> dim(data)
[1] 1470 35
> dim(training) # dimension of training set
[1] 1172 35
> dim(testing) # dimension of testing set
[1] 298 35
> logistic<-glm(Attrition~MonthlyIncome+PercentSalaryHike+JobSatisfaction,
training , family = "binomial") # glm function generalised linear model. a
s you change the family change binomial.
Error in eval(family$initialize) : y values must be 0 <= y <= 1
> data$Attrition<-ifelse(data$Attrition=="Yes", 1, 0)
> View(data)
> training <- subset(data, split == TRUE) # Training data set 80% of data
> testing <- subset(data, split == FALSE) # Testing data set 20% of data
> View(training)
> logistic<-glm(Attrition~MonthlyIncome+PercentSalaryHike+JobSatisfaction,
training , family = "binomial") # glm function generalised linear model. a
s you change the family change binomial.
> #default is independent variable
> # to implement logistic reg, set family to binomial
> summary(logistic)
Call:
glm(formula = Attrition ~ MonthlyIncome + PercentSalaryHike +
JobSatisfaction, family = "binomial", data = training)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0625789 0.4196189 -0.149 0.881449
MonthlyIncome -0.0001155 0.0000233 -4.958 7.11e-07 ***
PercentSalaryHike -0.0145208 0.0227212 -0.639 0.522767
JobSatisfaction -0.2754739 0.0732322 -3.762 0.000169 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation:
It is a logistic regression to predict employee attrition based on Monthly Income, Percent Salary Hike,
and Job Satisfaction. Here's an interpretation of the key components:
Model Summary:
The logistic regression model is built using the formula Attrition ~ Monthly Income + Percent Salary
Hike + Job Satisfaction on the training dataset. The model summary provides coefficients for each
predictor variable.
Coefficients:
Intercept: -0.0625789
Interpretation of Coefficients:
Monthly Income, Percent Salary Hike, and Job Satisfaction coefficients represent the change in the
log-odds of the dependent variable (Attrition) for a one-unit change in each predictor.
Negative coefficients suggest a decrease in the log-odds of Attrition for an increase in the
corresponding predictor.
Significance:
The p-values associated with each coefficient indicate their significance. Variables with smaller p-
values are generally more significant. In this case, Monthly Income and Job Satisfaction seem to be
significant predictors.
Model Accuracy:
After applying the model to the testing dataset, you calculated the accuracy using a threshold of 0.5.
The accuracy of your model is approximately 82.89%.
Prediction Probabilities:
The pred variable contains the predicted probabilities of Attrition being "Yes" for each observation in
the testing set.
Confusion Matrix:
The Matrix1 table represents a confusion matrix showing the comparison between the actual and
predicted values.
Accuracy Calculation:
The Accuracy variable holds the accuracy of your model, calculated as the ratio of correctly predicted
observations to the total observations.
In conclusion, your logistic regression model, based on the given predictors, seems to have decent
predictive performance with an accuracy of around 82.89%. You may further explore model
evaluation metrics and consider refining the model by adding or modifying predictors based on the
specific context and domain knowledge.
2.Ans:
library(tidyverse)
> library(caret)
> # Load the dataset
> your_dataset <- read.csv("C:/Users/ADMIN/Downloads/Attrition.csv") # Re
place with your dataset file path
> # Data preprocessing and exploration
> selected_variables <- c("MonthlyIncome", "PercentSalaryHike", "JobSatisf
action", "PercentSalaryHike")
> selected_data <- your_dataset %>%
+ select(all_of(selected_variables)) %>%
+ na.omit() # Remove rows with missing values
> selected_data <- your_dataset %>%
+ select(all_of(selected_variables)) %>%
+ na.omit() # Remove rows with missing values
> selected_data <- your_dataset %>%
+ select(all_of(selected_variables)) %>%
+ na.omit() # Remove rows with missing values
> # Split the dataset into training and testing sets
> set.seed(123)
> train_indices <- createDataPartition(selected_data$PercentSalaryHike, p
= 0.8, list = FALSE)
> train_data <- selected_data[train_indices, ]
> test_data <- selected_data[-train_indices, ]
> # Fit a linear regression model
> model <- lm(PercentSalaryHike ~ MonthlyIncome + PercentSalaryHike + JobS
atisfaction , data = train_data)
Warning messages:
1: In model.matrix.default(mt, mf, contrasts) :
the response appeared on the right-hand side and was dropped
2: In model.matrix.default(mt, mf, contrasts) :
problem with term 2 in model.matrix: no columns are assigned
> # Make predictions on the test set
> predictions <- predict(model, newdata = test_data)
Error in qr.Q(qr.default(tR, LAPACK = TRUE))[, (p + 1L):pp, drop = FALSE]
:
subscript out of bounds
> # Fit a linear regression model
> model <- lm(PercentSalaryHike ~ MonthlyIncome + Age + JobSatisfaction ,
data = train_data)
Error in model.frame.default(formula = PercentSalaryHike ~ MonthlyIncome +
:
variable lengths differ (found for 'Age')
> # Fit a linear regression model
> model <- lm(PercentSalaryHike ~ MonthlyIncome + PercentSalaryHike + Job
Satisfaction , data = train_data)
Warning messages:
1: In model.matrix.default(mt, mf, contrasts) :
the response appeared on the right-hand side and was dropped
2: In model.matrix.default(mt, mf, contrasts) :
problem with term 2 in model.matrix: no columns are assigned
> # Fit a linear regression model
> model <- lm(PercentSalaryHike ~ MonthlyIncome + JobSatisfaction, data =
train_data)
> # Make predictions on the test set
> predictions <- predict(model, newdata = test_data)
> # Evaluate model performance (e.g., Mean Squared Error)
> mse <- mean((predictions - test_data$PercentSalaryHike)^2)
> # Evaluate model performance (e.g., Mean Squared Error)
> mse <- mean((predictions - test_data$PercentSalaryHike)^2)
> cat("Mean Squared Error:", mse, "\n")
Mean Squared Error: 13.40516
> # Summary of the linear regression model
> summary(model)
Call:
lm(formula = PercentSalaryHike ~ MonthlyIncome + JobSatisfaction,
data = train_data)
Residuals:
Min 1Q Median 3Q Max
-4.362 -3.157 -1.220 2.700 9.977
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.532e+01 3.224e-01 47.504 <2e-16 ***
MonthlyIncome -2.409e-05 2.287e-05 -1.053 0.292
JobSatisfaction 1.796e-02 9.634e-02 0.186 0.852
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation:
Let's interpret the results of the linear regression model:
Coefficients:
Intercept (14.95): The intercept represents the estimated average value of Percent Salary Hike when
all other predictors are zero. In this context, it might not have a meaningful interpretation.
Monthly Income (-2.409e-05): The coefficient for Monthly Income is approximately -2.409e-05. This
implies that, on average, a one-unit increase in Monthly Income is associated with a decrease of -
2.409e-05 units in Percent Salary Hike. However, since this coefficient is very close to zero and not
statistically significant (p-value = 0.292), Monthly Income may not be a significant predictor in this
model.
Residuals:
The residuals represent the differences between the observed and predicted values of Percent Salary
Hike. The summary shows statistics such as the minimum, 1st quartile, median, 3rd quartile, and
maximum values of the residuals.
This is an estimate of the standard deviation of the residuals. It indicates the average amount by
which the observed values deviate from the predicted values.
These values measure the proportion of variance explained by the model. In this case, the low R-
squared values suggest that the model does not explain much of the variability in Percent Salary
Hike.
The F-statistic tests the overall significance of the model. The p-value associated with the F-statistic is
0.5629, suggesting that the model as a whole may not be statistically significant.
In summary, based on the provided results, it seems that neither Monthly Income nor Job
Satisfaction significantly predicts Percent Salary Hike in this model. The model's overall fit is not
statistically significant, and the predictors do not have a strong relationship with the
response variable.