0% found this document useful (0 votes)
94 views52 pages

Technical Internship Report - HR Dataset

This document discusses exploratory data analysis (EDA) on employee attrition using Python. It begins with an introduction to data analytics, EDA, and the steps involved. It then provides details about the company where the internship is taking place, Teklink Software Pvt. Ltd., including their services. The document outlines the project which will analyze a dataset on employee attrition using EDA techniques in Python like Pandas, Numpy, and Matplotlib to understand the key factors affecting attrition. The results will help the management make decisions to reduce attrition.

Uploaded by

Ritisha gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views52 pages

Technical Internship Report - HR Dataset

This document discusses exploratory data analysis (EDA) on employee attrition using Python. It begins with an introduction to data analytics, EDA, and the steps involved. It then provides details about the company where the internship is taking place, Teklink Software Pvt. Ltd., including their services. The document outlines the project which will analyze a dataset on employee attrition using EDA techniques in Python like Pandas, Numpy, and Matplotlib to understand the key factors affecting attrition. The results will help the management make decisions to reduce attrition.

Uploaded by

Ritisha gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

SVKM’s NMIMS

Mukesh Patel School of


Technology Management and
Engineering

Technical Internship

Interim Report on

EDA on Employee
Attrition In Python

At:

By:

Ritisha Gupta

Roll Number: I209

SAP ID :70411119011

Faculty Mentor :

Prof. Prashant Udawant

Industry Mentor :

Mr. Sanjeet Porwal


1
TABLE OF CONTENT
1.Acknowledgement
2.Abstract
3.Table of Contents
4.About the Company
5.Introduction
5.1 Data Analysis
5.1.1Exploratory Data Analysis
5.1.2 Techniques of EDA
5.2 Python
5.2.1 Pandas
5.2.2 Numpy
5.2.3 Matplotlib
5.2.4 Seaborn
5.3 Statistics
5.3.1 Descriptive analysis
5.3.2 Normal Distribution
5.3.3 Hypothesis

6.Project- EDA on Employee Attrition


7.Conclusion
8.References

2
Acknowledgement

The Technical Internship at Teklink Pvt. Ltd has been a professionally satisfying experience. This
opportunity helped me expand my knowledge base and learn about new technologies and skills as
well as their implementation. It enabled me to enhance my technical skill set by learning and
working in a dynamic role. It facilitated interaction with individuals at the organization and a deep
understanding about professionalism, work culture and work ethics.

I would begin by expressing my sincerest gratitude to Mr. Pankaj Gupta, the head of Teklink
office Indore , for providing me with the opportunity to experience the work culture at this
organization by enabling me to work on a project of my interest and according to the demands of
our university.

I would like to extend my sincere gratitude to my industry mentor Mr. Sanjeet Porwal , the
Associate Consultant at Teklink, for having the trust and confidence in me and providing this
opportunity and for guiding me throughout the project period and helping me undergo training at the
organization. This internship has been a steppingstone towards working with highly skilled
professionals and executives.

I would like to thank my faculty mentor- Prof. Prashant Udawant for the continued support at all

stages and for providing me the necessary guidance.

I would like to acknowledge the backing and guidance of all the professors and staff members of
NMIMS’ MPSTME for including this TIP in the curriculum so that students could get a
professional experience of working at an organization and gain industry experience.

Once again, my hearty thanks to everyone at Teklink Pvt.Ltd for having made the internship
program a fruitful one. Last but not the least; I would end by thanking my parents and the entire
team of Teklink for acting as the constant source of inspiration and for motivating me.

3
Abstract

Everything today runs on Data. Be it from Social media to large companies. The
term data refers to information about anything. Each company and each
institution has a set of data to be maintained that they have earned, collected, and
maintained over a period of time. These data are collected, maintained, and
analyzed to improve and evaluate the growth of the companies. Hence businesses
use data analytics which is the process of exploring and analyzing large datasets
to find hidden patterns, unseen trends, discover correlations, and derive valuable
insights to make business predictions. It improves the speed and efficiency of
your business.
Employee attrition can become a serious issue because of the impacts on the
organization’s competitive advantage. It can become costly for an organization.
The cost of employee attrition would be the cost related to the human resources
life cycle, lost knowledge, employee morale, and organizational culture.

This project is aimed to analyse employee attrition using Exploratory Data


Analysis in Python. The result obtained can be used by the management to
understand what modifications they should perform in the workplace to get most
of their workers to stay.The result of the project contains information on key
factors that are affecting attrition of the organisation. The conclusions can be
taken into considerations for effective decision making in Human resource
department.

4
About The Company
TekLink Software Pvt. Ltd headquartered in Warrenville, IL, USA, and has
offices across the globe in Europe and India. TekLink is a leading consulting
partner in Business Intelligence, Data and Analytics, Planning, Forecasting and
IaaS solutions for many Fortune 500 companies. Beyond strategic consulting,
TekLink is a full service Planning and Analytics service provider including
Design, Implementation and Application Management Services. TekLink has
established business, technical and functional expertise across multiple
industries, including consumer products, pharmaceuticals, manufacturing &
distribution, retail, utilities and high tech. TekLink has partnership with SAP,
Microsoft, Snowflake, Tableau etc. Its Global Development Model is supported
by offshore delivery center enabling us to provide flexible and cost-effective
solutions.TekLink has 350 employees across the globe.
The services provided by TekLink are
● Data Science and Advanced Analytics
● Enterprise Data Lake and Data Warehouse
● Data Engineering and Data Integration
● Data Visualization and Storytelling
● DevOps for Analytics
● Cloud Infrastructure and Migration Services
● Extended Planning and consolidation
● Trade Management Applications
● Supply Chain Management
● Technical Services
● Innovation Labs
● Roadmaps and Advisory Services

5
Introduction

Data Analytics
Companies around the globe generate vast volumes of data daily, in the form of log files,
web servers, transactional data, and various customer-related data. In addition to this, social
media websites also generate enormous amounts of data.
Companies ideally need to use all of their generated data to derive value out of it and make
impactful business decisions. Data analytics is used to drive this purpose.
Data analytics is the process of exploring and analyzing large datasets to find hidden
patterns, unseen trends, discover correlations, and derive valuable insights to make business
predictions. It improves the speed and efficiency of your business.
Businesses use many modern tools and technologies to perform data analytics.

Data analysis features a wide range of approaches, facts, and techniques under different
names, such as prescriptive analysis, predictive analysis, diagnostic analysis, statistical
analysis, and text analysis.
In statistical applications, data analysis incorporates two key concepts - CDA (confirmatory
data analysis) and EDA (exploratory data analysis). While CDA emphasizes on falsifying or
confirming existing hypotheses, EDA zeroes in on exploring and identifying new data
features. 
Data scientists implement exploratory data analysis tools and techniques to investigate,
analyze, and summarize the main characteristics of datasets, often utilizing data visualization
methodologies. 

Steps of data analytics are


1.Understand the business problem
Understanding the business problems, defining the organizational goals, and planning a
lucrative solution is the first step in the analytics process
2.Data Collection
Next, you need to collect transactional business data and customer-related information from
the past few years to address the problems your business is facing.

6
3. Data cleaning
You need to clean the data to remove unwanted,
redundant, and missing values to make it ready for
analysis.
4.Data exploration and analysis
After you gather the right data, the next vital step is
to execute exploratory data analysis. You can use
data visualization and business intelligence tools,
data mining techniques, and predictive modeling to
analyze, visualize, and predict future outcomes
from this data.
5.Interpreting the relationships
The final step is to interpret the results and validate if the outcomes meet your expectations.

Exploratory Data Analysis


EDA techniques allow for effective manipulation of data sources, enabling data
scientists to find the answers they need by discovering data patterns, spotting
anomalies, checking assumptions, or testing a hypothesis. 
Data specialists primarily use exploratory data analysis to discern what datasets
can reveal further beyond formal modeling of data or hypothesis testing tasks.
This enables them to gain in-depth knowledge of the variables in datasets and
their relationships.  
Exploratory data analysis can help detect obvious errors, identify outliers in
datasets, understand relationships, unearth important factors, find patterns within
data, and provide new insights. 
Exploratory data analysis can be performed using Python or R .Also for effective
Exploratory data analysis BI tools like Tableau ,Power BI can be used.
Implementing through python programming language can be done by using
libraries such as Seaborn, Matplotlib.
The exploratory data analysis steps that an analysts have in mind when
performing EDA include:
● Asking the right questions related to the purpose of data analysis
7
● Obtaining in-depth knowledge about problem domains
● Setting clear objectives that are aligned with the desired outcomes.

There are four exploratory techniques that are mainly used:


1. Univariate Non-Graphical
This is the simplest type of EDA, where data has a single variable. Since
there is only one variable, data professionals do not have to deal with
relationships. 
2. Univariate Graphical
Non-graphical techniques do not present the complete picture of data.
Therefore, for comprehensive EDA, data specialists implement graphical
methods, such as stem-and-leaf plots, box plots, and histograms. 

3. Multivariate Non-Graphical
Multivariate data consists of several variables. Non-graphic multivariate
EDA methods illustrate relationships between 2 or more data variables
using statistics or cross-tabulation.

4. Multivariate Graphical
This EDA technique makes use of graphics to show relationships between 2
or more datasets. The widely-used multivariate graphics include bar chart,
bar plot, heat map, bubble chart, run chart, multivariate chart, and scatter
plot. 

Employee Attrition

Employee Attrition is the gradual reduction in staff numbers that occurs as


employees retire or resign and are not replaced. Employee attrition can be costly
for businesses. The company loses employee productivity, and employee
knowledge.
“Turnover / Churn” and “Attrition” are human resource terms that are often times
confused. Employee turnover and attrition both occur when an employee leaves
the company. Turnover , however, is from several different actions such as
8
discharge, termination, resignation or abandonment. Attrition occurs when an
employee retires or when the employer eliminates the position. The big
difference between the two is that when turnover occurs, the company seeks
someone to replace the employee. But in the case of attrition, the employer
leaves that vacancy unfilled or eliminates that job role.
In the project as a Data Analyst intern I have performed basic EDA on HR
dataset which results in observations for key factors that affect the Attrition
which can be helpful in effective decision making.

Python

Python is an object-oriented, high-level programming language.Python has a number


of distinguishing characteristics that make it the best option for data analysis.It is
because it is easy to learn, flexible, huge libraries collection , graphics and
visualization,and built-in data analytical tools.
there are various libraries in python but i have mainly used four libraries which are

Pandas
Pandas is a Python library used for working with data sets.It has functions for
analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis"
and was created by Wes McKinney in 2008.

Pandas allows us to analyze big data and make conclusions based on statistical
theories.Pandas can clean messy data sets, and make them readable and
relevant.Relevant data is very important in data science.
Pandas gives you answers about the data. Like correlation between two features, the
average value of the feature ,maximum and minimum value of the features.Pandas are
also able to delete rows that are not relevant, or contain wrong values, like empty or
NULL values. This is called cleaning the data.

9
Numpy
NumPy is a Python library used for working with arrays.It also has functions for
working in domain of linear algebra, fourier transform, and matrices.NumPy was
created in 2005 by Travis Oliphant. It is an open source project and you can use it
freely.NumPy stands for Numerical Python.
In Python we have lists that serve the purpose of arrays, but they are slow to
process.NumPy aims to provide an array object that is up to 50x faster than traditional
Python lists.The array object in NumPy is called ndarray, it provides a lot of
supporting functions that make working with ndarray very easy.Arrays are very
frequently used in data science, where speed and resources are very important.

Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays.
Matplotlib is a multi-platform data visualization library built on NumPy arrays and
designed to work with the broader SciPy stack. It was introduced by John Hunter in
the year 2002.
One of the greatest benefits of visualization is that it allows us visual access to huge
amounts of data in easily digestible visuals. Matplotlib consists of several plots like
line, bar, scatter, histogram etc.

Seaborn
Seaborn is an amazing visualization library for statistical graphics plotting in Python.
It provides beautiful default styles and color palettes to make statistical plots more
attractive. It is built on the top of matplotlib library and also closely integrated to the
data structures from pandas.
Seaborn aims to make visualization the central part of exploring and understanding
data. It provides dataset-oriented APIs, so that we can switch between different visual
representations for the same variables for better understanding of the dataset..
In my project I have mainly used seaborn for univariate,bivariate graphical analysis.
The most used plts are countplot,catplot,barplot strip plot,histogram.

10
Statistics
Statistics incorporates data acquisition, data interpretation, and data validation, and
statistical data analysis is the approach of conducting various statistical operations, i.e.
thorough quantitative research that attempts to quantify data and employs some sorts
of statistical analysis.
Data comprises variables which are univariate or multivariate, and extremely relying
on the number of variables, the experts execute several statistical techniques.If the data
has a singular variable then univariate statistical data analysis can be conducted
including t-test for significance, z test, f test, ANOVA test- one way, etc
Data is of two types, continuous data and discrete data. The continuous data cannot be
counted and changes over time, e.g the intensity of light, the temperature of a room,
etc.
Data can either be quantitative or qualitative.
Under statistical data analysis, data is distributed in Probability Density Function and
Probability Mass Function.
Various software programs are available to perform statistical data analysis, these
software include Statistical Analysis System(SAS),and many more.
There are two types of widely used statistical methods

1.Descriptive Statistics
Descriptive statistics attempts to illustrate the relationship between variables in a sample or
population and gives a summary in the form of mean, median and mode.

2.Inferential Statistics
This method is used for making conclusions from the data sample by using the null
and alternative hypotheses that are subjected to random variation.

11
Sample dataset work
In the first Four weeks I have learned about
● Python- its basic functionality also about the company and its services.
● Python libraries -Numpy, Pandas, Seaborn, Matplotlib
● Implementing libraries functions for data cleaning and data visualization on
sample dataset.
● Learning about EDA ,Statistics, With initial implementation of final dataset

Sample dataset - Student Mental health


It was a small dataset which mainly contained the information about
students weather having depression, anxiety, panic attacks and if so are
they going for specialist treatment. Some key analysis from the dataset are

12
Here you can see the object data type are categorical variables.

Here we can see the dataset contains 101 rows and 10 columns. Also the age is
the only numerical variable with the count of 100 and mean 20.53 the minimum
value is 18 and the maximum value is 24.

13
In the first step we checked for null values, wherein we found 1 in the age
column. To removed the null value the dropna() command is used wherin axis=0
displays removing the complete row containing the null value .

We can see the gender count where females are dominating the males .Through
seaborn library we can visualize the male and female counts .

This graph depicts how many males and females have depression. We can
analyse that more females are in depression than males.

14
Through this graph we can analyse that a lot of students does not consult a
specialist which clearly depicts lack of focus towards mental health .

We have students from all age groups 18-24 wherein students of age 20-21 are
not present.

15
16
17
18
Main project
Performing EDA to find out Employee attrition factors.
My final dataset is HR dataset on which I performed Exploratory Data Analysis to find out
major factors that are contributing to attrition of employees.

At
first I have imported the libraries that I have used all over the project.
Mainly I used Pandas,Numpy,Matplotlib and Seaborn.
19
After that i have uploaded the dataset using pandas.read function and with data.head we
viewed the initial 10 rows of the dataset.

The dataset contains 1470 rows or entries and 35


columns or features.
The data.dtypes here shows the datatypes of each
variable. As you can see there are 2 types of data int64
which is numerical continuous data and the object
datatype means categorical variables.

Here i have used .isna. sum command which gives us the


feature wise sum of the null values. As observed there are
no null values.

With the nunique command we can get the unique values of each feature

20
The dataset contained many unwanted or irrelevant features so i have removed them and
after this i have 24 features in total.
The list of features is :
1. Attrition: It has 2 values yes or no which determine whether an employee is in the
organisation or has left the organization. It is a categorical variable.

21
This is the description of the numerical and the categorical values. In the numerical data we have the
total count ,then mean which is average ,the standard deviation,minimum and the maximum values .
Also the interquartile values are also given . The 50% values is the median value.

I have taken top 7 factors that might be affecting the attrition rate mostly.

22
Here it was seen that the attrition in the organization is 16 % i.e. 16 % of the total recruited
employees have left the organization.

23
The age feature has been taken for analysis where the average age is 36 years the
modal age is 35 years .
Also the maximum employees of the company are in the age range of 30 to 43
years .
The overall range is 18 to 60 i.e. there are employees of age as low as 18 years to
60 years employees as well.

24
This is a histogram of the age variable. The age range of most employees is
between 27 to 40.The curve seemed to be peaking in the middle so I have
checked for the curve has normal distribution.

Here i have checked the standard deviation through numpy as well.Also


skewness and kurtosis count as well. They both show a slight deviation from
normal distribution but are considered as normal distribution only .

25
26
The age group of middle aged employees from 26 to 46 have been rated high in
performance rating.

27
As we can observe the age groups of younger people and middle aged people have lower income and
high attrition.

28
The employees doing overtime are getting attrited more.

29
Monthly Income

30
31
32
33
34
35
36
There is no impact of business travel on attrition.

37
38
39
40
41
42
43
44
45
46
47
Observations
Age
● The age feature has been taken for analysis where the average age is 36 years the
modal age is 35 years .
● Also the maximum employees of the company are in the age range of 30 to 43 years .
● The curve seemed to be peaking in the middle so I have checked for the curve has
normal distribution.
● The overall range is 18 to 60 i.e. there are employees of age as low as 18 years and as
high as 60 years employees as well.
● The age group of middle aged employees from 26 to 46 have been rated high in
performance rating.

48
● As we can observe the age groups of younger people and middle aged people have
lower income and high attrition.
Overtime
● The employees doing overtime are getting attrited more.
● The employees are not getting paid for the overtime and hence they are getting attrited.

● The employees who are doing overtime are having low salary hence they are leaving the
company and therefore increasing the attrition rate. Though overall overtime has a impact on
attrition.
Monthly Income
● Through this description we can see the average Monthly Income of the employees is $6502.
● The median salary i.e $4919 is lesser than the average/mean salary .
● The employees receiving low income are leaving the organisation.
● There is no effect of performance rating on income .
● With increase in job level the income increases.
● Employees have high job satisfaction but still due to low income they are leaving
● Employees have good work life balance but are leaving the organization due to low income.
Business Travel
● There is no impact of business travel on attrition.

● It is seen R&D department requires more Travel.

Gender
● There are 60% male employees and 40 % female employees.
● The males seem to have been attrited more.
● It was observed that females are getting highly paid.
● The males seem to be more comfortable in the workplace.
● There is no gender bias in Job Involvement.
Overall there was no gender bias and hence gender has no effect on attrition.
Distance From Home

● The graph seems to be rightly skewed which means as the maximum employees are living at
a close distance.
● The Maximum attrition is seen in the range of 0 to 10 km . Hence distance from home is not
an important factor for attrition.
Marital Status
● Divorced employees are doing more overtime.
● It is seen that single employees are leaving the company more. It shows the trend that single
employee have less responsibility and hence are ok with variation.
49
Years Since Last Promotion

● The range of years is 0 to 15 i.e there are employee who have not been promoted in last 15
years as well. Also the average no. of years are 2.
● Maximum employees are in the range of 0 to 2 years since last promotion.
● With increase in years since last promotion the attrition has increase hence it is affecting the
attrition rate .
Total Working Years
● The range of experience employees has is 0 to 4.
● The maximum employees has experience of 10 years.
● The employees with less experience are leaving the organization and employees with higher
experience are stable.

Results
The major factors affecting the attrition are:
● Age
● Overtime
● Monthly Income
● Total working Years
● Years of Promotion
The factors which are not affecting are:
● Gender
● Distance From Home
● Job Involvement
● Marital Status
● Business Travel

50
Conclusion

I Have learnt a lot through this internship.


● A brief in about data analytics
● Types of analytics
● Role of statistic in data analytics
● Python
● Various libraries and frameworks such as pandas ,seaborn
● Creating hypothesis
● Data visualization
● Drawing conclusions from analysis
● various types of graphs
● Types of EDA
● Implementation of types of EDA

I understood a data analyst analyses data sets to find ways to solve problems
relating to a business's customers. A data analyst also communicates this
information to management and other stakeholders. The employment of
these individuals encompasses many different industries such as business,
finance, criminal justice, science, medicine, and government.
The results obtained can be used by the Hr department for reducing the
attrition rate by reducing the major factors affecting attrition.

51
References

● YouTube
● Google
● Kaggle
● Pandas documentation, Matplotlib documentation, Numpy Documentation
● W3Schools
● Geeksforgeeks

52

You might also like