Data Mining: Slide-1: Title Slide
Data Mining: Slide-1: Title Slide
Intro:
We all have paid 11 lakhs, and al. this for placements .... ??
Problem Statement:
To predict if a candidate was placed in a role after their MBA studies and if so, then which factors
helped the most (i.e., work experience, degree, school results, gender, etc)
Link: https://www.kaggle.com/benroshan/factors-affecting-campus-placement
We selected the Campus Recruitment dataset from Kaggle which was made available by Ben Roshan
Shape: 215 * 15
This is a snapshot of the dataset, It contains 15 variables and a total of 215 observations with the
following details
specialisati
factor Post Graduation (MBA) - Specialization
on
We have 7 columns with real values (Numeric) and 8 with object data type (Categorical)
So, before moving ahead we checked for Null and NA values in the dataset.
Slide- 3: Dataset Overview
df.info()
Code
Number of
NULLs
Code
Number of
NAs
67
Code
sl_ gend ssc ssc hsc hsc hsc degree degre wor etest specialis mba stat sala
no er _p _b _p _b _s _p e_t kex _p ation _p us ry
0 0 0 0 0 0 0 0 0 0 0 0 0 0 67
So, we can see that there are no NULL values but there are 67 NA values and all are from the salary
field. Now, we need to check why we have these 67 NAs in the salary category. Is this missing data or
another reason behind it?
Code
Status n
Not
67
Placed
Placed 148
It looks like 67 NAs in the salary column are due to the fact that 67 students did not get a placement.
This makes sense and therefore, no further investigation is needed.
Explanation:
So, from the dataset overview we can say that, except for hsc_s and degree_t with 3 classes, all others
have 2 classes each and also we can see that this data is slightly imbalanced as we have 148 placed
students and 67 not placed students. This means that around 31% candidates were not placed which is
sad but let's see what were the reasons :) by performing the EDA which would be explained by Noor
and Parikshit
1. University Scores:
2. MBA:
Interpretation:
From, the above two graphs we can see that female students scored significantly higher
percentages than men at university and MBA level and there is no significant difference in
performance during secondary, higher secondary levels and employability test (We haven’t
added those graphs)
Interpretation:
1. And, from the below two graphs we can see that the dataset contains a sample of 139
male students and 76 female students which means the number of male students is
almost double as compared to female students
2. And, more outliers in the male box plot tells that they are getting high CTC jobs
3. And, are offered slightly greater salaries than female students on an average
Slide- 6: Which degree and MBA specialization has the highest Salary?
The next variable is specialisation. Now, let’s check the impact of specialisation on the chances of
receiving a better score or place for an offer?
Interpretation:
1. Looks like Commerce and Science degree students are preferred by companies which is
obvious
2. Students who opted for Others have very low placement chance
3. Specialisation is a clear indicator in placements. Significantly more Marketing and Finance
students received an offer when compared to those specialised in Marketing and HR. This
might be because there is a low requirement for HR in a company
4. The last graph shows that Mkt & Fin students get highly paid jobs and also that Commerce &
Mgmt students occasionally get dream placements with high salary
So, now let’s check if academic scores influence the chances of placements or not
In this correlation plot, the darker the colour is the higher the correlation.
Interpretation:
1. And here, as we can observe there are medium correlations between the academic
scores which suggests that the students who performed well in secondary school also
performed well in their further education (i.e., higher secondary, university and
MBA)
2. Also, we can notice that employability test scores have a low correlation with
academic scores therefore we can say that these tests were more practical than
theoretical
Next let’s check how the scores for each level of education are distributed
2. What does the distribution of the scores look like for each level of education?
a. Average academic scores Vs Placement Status (How many students were placed?):
b. Secondary:
c. Higher Secondary:
d. University:
e. MBA:
f. Employability:
Interpretation:
We can see that,
1. From the 1st graph, we can see that, most of the candidates educational
performances are between 60 - 80%
2. The distribution is more concentrated around the median range (62 - 66%) as
the students progressed in their education, from secondary (wide distribution)
to MBA (which is a narrow distribution)
3. The employability test has a different trend, with a very wide and almost
equal distribution of each bucket
Interpretation:
Significantly more students with work experience received offers than those without any work
experience. Work Experience is a clear indicator as more work experience results in higher CTC jobs.
Slide- 9: Salary
And the last variable is salary
Interpretation:
1. Looking at the distribution we can say that the most of the students get a package between
200k - 400k and most salaries above 400,000 are outliers.
2. Male candidates are making more money as compared to female candidates
2. Create a Correlation Matrix: Correlation is a statistical technique that can show whether and
how strongly pairs of variables are related
3. Feature selection: From the correlation matrix we can now select the features for our model
that are highly correlated with the placement status variable. Ssc_p, hsc_p, degree_p, workex,
and specialisation are the 7 significant features that will help our model identify patterns.
Now that we have our variables decided, let’s move on to perform logistic regression as explained by
Ranjani
Slide 13 (Conclusion):
Here are a few things to keep in mind:
1. overall, the grades became more concentrated as the students progressed in their education; it
could be that it is harder for students to differentiate on grades alone and that they will focus
on other achievements (work experience, voluntary roles)
2. successfuly placed students performed significantly better than their counterparts during
secondary, highschool and university, but not at the MBA level