R1 - Developing Classifiers Through Machine Learning
R1 - Developing Classifiers Through Machine Learning
An International Journal
To cite this article: Laxmi Shanker Maurya, Md Shadab Hussain & Sarita Singh (2021)
Developing Classifiers through Machine Learning Algorithms for Student Placement
Prediction Based on Academic Performance, Applied Artificial Intelligence, 35:6, 403-420, DOI:
10.1080/08839514.2021.1901032
Introduction
Placement is a decisive factor of successful completion of any coursework at
the graduate or postgraduate level. It is a dream of every student to get placed
in top MNCs to achieve their set goals and objectives. Aiming to place the
maximum number of students, the universities and institutions are leveling up
their game by equipping and upgrading their students through training and
placement cells (Accessed May 04, 2020).
Machine learning is the science of getting computers to learn, without being
explicitly programmed. Each time you need your e-mail and a spam filter saves
you from having to wade through tons of spams, again, that’s because your
CONTACT Laxmi Shanker Maurya [email protected] Department of Computer Science and Engineering,
Shri Ram Murti Smarak College of Engineering and Technology, Bareilly 243001, Uttar Pradesh, India
© 2021 Taylor & Francis
404 L. S. MAURYA ET AL.
computer has learned to distinguish spam from non spam e-mail. So that’s
machine learning (Accessed June 27, 2020).
According to the Samuel “The field of study that gives computer the ability
to learn without being explicitly programmed”. This is an older definition of
machine learning. Other definition is given by Tom Mitchell “A computer
program is said to learn from experience E with respect to some class of tasks
T and performance measure P, if it’s performance at task in T, as measured by
P, improves with experience E” (Accessed June 27, 2020).
Classification Algorithms
It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of iden
tifying to which of a set of categories, a new observation belongs to, on the
basis of a training set of data containing observations and whose categories
membership is known.
Example: Before starting any Project, we need to check its feasibility. In this
case, a classifier is required to predict class labels such as ‘Safe’ and ‘Risky’ for
adopting the Project and to further approve it. It is a two-step process such as:
Related Work
In this section, we review the literature on classification methods for binary
and multi-label classification, and provide an overview of the work done by
different researchers.
Give an introduction to classification algorithms and the metrics that are
used to quantify and visualize their performance. They first briefly explain
what we mean with a classification algorithm, and, as an example, they
describe in more detail the naive Bayesian classification algorithm. Using the
concept of a confusion matrix, they next define the various performance
metrics that can be derived from it (Korst et al. 2019).
Described a model based on data mining for student placement prediction
using machine learning algorithms. To extract the meaningful information of
datasets, this process was called as data mining by using machine learning
algorithms. Authors also used education data mining tool, which was to be
considered as more powerful tool in educational domain. It presents an
effective method for extracting the student’s performance based on various
parameters and predicts as well as analyses whether the students were
recruited or not during the campus placement. Predictions were performed
using machine learning algorithm J48, Naive Bayes, Random Forest, and
Random Tree in weka tool and multiple linear regression. Based on the result,
higher education organization can offer superior training to their students
(Rao et al.2018).
Different classifiers algorithms namely Naive Bayes, Multilayer perceptron,
Instance-based K-Nearest Neighbor (IBK), J48 Decision Tree, Simple Cart,
Zero R, CV Parameter, and Filtered Classifier performance was analyzed. The
diabetes datasets, nutrition datasets, E. coli protein datasets, mushrooms
datasets were used for calculating the performance by using the cross valida
tions of parameter. Finally identified classification algorithms performance
406 L. S. MAURYA ET AL.
This paper addresses the placement chance prediction problem and place
ment and skill ranking predictors for programming classes using class attitude,
psychological scales, and code metrics of the student, respectively (Elayidom,
Idikkula, and Alexander 2011) (Ishizue et al. 2018). Their qualitative study
investigates the career placement concerns of international graduate students
returning to their home countries, heading to other countries, or remaining in
the United States after their education (Shen, Yih-Jiun, and Edwin L. Herr
2004).Their study analyzes student performance in engineering placement
using data mining (Agarwal et al. 2019).Predicted student’s campus placement
probability using binary logistic regression (Kumar et al. 2019).Uses psychol
ogy-assisted prediction of academic performance using machine learning
(Halde, Deshpande, and Mahajan 2016).
Presented a perspective on the overall process of developing classifiers for
real-world classification problems (Brodley, C. and Smyth, P. 1997). In their
paper analyzes how to introduce machine learning algorithms into the process
of direct volume rendering. A conceptual framework for the optical property
function elicitation process is proposed and particularized for the use of
attribute-value classifiers (Cerquides et al. 2005).Performed a study on per
formance analysis of classification algorithms for activity recognition using
Micro-Doppler Feature (Lin, Yier, and Julien Le Kernec 2017). In their study
performed news articles classification using Random Forests and Weighted
Multimodal Features (Liparas et al. 2014).
In this paper, authors analyzed and performed computation times of differ
ent classification algorithms on many datasets using parallel profiling and
computing techniques. Performance analysis was based on many factors, such
as the unique nature of the dataset, the size, and type of the class, the diversity of
the data in the data set, and so on (Upadhyay, Navin Mani, and Ravi Shankar
Singh 2018).The authors illustrated the text classification process on different
dataset using some standard supervised machine learning techniques (Mishu,
SadiaZaman, and S. M. Rafiuddin, 2016).In their study aims to identify the key
trends among different types of supervised machine learning algorithms, and
their performance and usage for disease risk prediction (Uddin et al. 2019).
Performed a study based on multi-label classification with weighted classifier
selection and stacked ensemble (Xia, Yuelong, Ke Chen, and Yun Yang, 2020).
Dataset
The dataset used in the study is collected from the students of the final year B.
Tech. CSE & IT branch of Shri Ram Murti Smarak College of Engineering and
Technology (SRMSCET), Bareilly, Uttar Pradesh (India). These students have
undergone through the various placement drives in the current academic
session of 2019–20. The three input features selected in the dataset are the
percentage marks achieved by the students in class Tenth, Twelve and B. Tech,
respectively. The fourth input feature is the number of Backlog pending till the
date of data collection in B.Tech. The output/target class is whether the
student is placed in any of the placement drives or not. A 1 in the output
column indicates that the student is placed and a 0 indicates that the student is
unplaced. All four input features and the target class are categorical in nature.
The entries in first three input features i.e. percentage marks acquired by the
students in class Tenth, Twelve and B.Tech are summarized as follows: 1 is less
than 60%, 2 is greater than or equal to 60% but less than 70%, 3 is greater than
or equal to 70% but less than 80%, 4 is greater than or equal to 80% but less
than 90% and 5 is greater than or equal to 90%. The total numbers of
respondents in the dataset are 170.A Google form with appropriate instruc
tions was designed and sent to the students for data collection.
Tools Used
All the eight classification algorithms used to build classifiers are implemented
using following libraries of Python:
Seaborn – for heat map generation.
Scikit learn/sklearn – for algorithm implementation.
Pandas – for dataset-related operations.
Matplotlib – for plotting.
Google Colaboratory – a free cloud service of Google was used to write and
execute code in Python.
Experimental Input
Table 1. Representing selected algorithm, input features, target class, training and test data
percentage for developing the classifiers.
Input Features Target Class
Selected OR OR
Sr. No. Algorithm Feature Matrix(x) Response Vector(y) Training Data Test Data
1 Gaussian Naive Bayes Tenth Placed 80% 20%
Twelve
BTech
Backlog
2 K-Nearest Neighbor Tenth Placed 80% 20%
Twelve
BTech
Backlog
3 Support Vector Machine Tenth Placed 80% 20%
Twelve
BTech
Backlog
4 Stochastic Gradient Descent Tenth Placed 80% 20%
Twelve
BTech
Backlog
5 Random Forest Tenth Placed 80% 20%
Twelve
BTech
Backlog
6 Decision Tree Tenth Placed 80% 20%
Twelve
BTech
Backlog
7 Logistic Regression Tenth Placed 80% 20%
Twelve
BTech
Backlog
8 Neural Network Tenth Placed 80% 20%
Twelve
BTech
Backlog
Table 2. Representing developed classifier, optimum value of random state, accuracy score, and
percentage accuracy score generated by the developed classifier and remark (in any).
Developed Random Accuracy Percentage Accuracy
Sr. No. Classifier State Score Score Remark
1 Gaussian Naive Bayes 44 0.8823 88.23 default parameter
2 K-Nearest Neighbor 08 0.8823 88.23 n_neighbors = 13
3 Support Vector Machine 08 0.8529 85.29 kernel = ‘linear’
4 Stochastic Gradient 08 0.9117 91.17 default parameter
Descent
5 Random Forest 08 0.8529 85.29 n_estimators = 100
6 Decision Tree 03 0.8235 82.34 criterion = ‘entropy’
7 Logistic Regression 08 0.8529 85.29 default parameter
8 Neural Network 08 0.8529 85.29 default parameter
(Continued)
412 L. S. MAURYA ET AL.
Table 3. (Continued).
Developed Confusion
Sr. No. Classifier Matrix Heatmap
5 Random Forest [[6 2]
[3 23]]
Table 6. AUC (area under curve) – ROC (receiver operating characteristic) curve.
Developed
Sr. No. Classifier AUC ROC Curve
1 Gaussian Naive Bayes 0.76
(Continued)
416 L. S. MAURYA ET AL.
Table 6. (Continued).
Developed
Sr. No. Classifier AUC ROC Curve
5 Random Forest 0.76
Table 5 represents MSE and Log Loss. MSE is an accuracy parameter and Log
Loss is a performance parameter. Although MSE and Log Loss are more
significant in case of regression problem and the value of Log Loss must be in
the range of 0 and 1. Our problem is classification and the value of Log Loss is
exceeding 1 which is not significant in our case. But, the calculated values of
MSE and Log Loss are minimum in case of Stochastic Gradient Descent which
is having highest accuracy score of 0.9117 as mentioned in section 4.2. Although
this table is not very significant yet it validates our results in section 4.2.
Acknowledgments
This research is accomplished by a team of three faculty members from the Department of
Computer Science and Engineering at Shri Ram Murti Smarak College of Engineering and
Technology (SRMSCET), Bareilly, Uttar Pradesh (India). The authors are thankful to the
management of SRMSCET for motivating and providing all kinds of research infrastructure
and facilities such as print and online journals of repute in the college library and computing
and Internet facility in the department. The authors are also thankful to all final year students
of CSE & IT branch of SRMSCET for being prompt and supportive in the data collection
process. We are thankful to our family members and colleagues for their support and
cooperation. Finally, our special thanks to faculty members Hiresh Gupta and Durgesh
Tripathi for their valuable direction.
ORCID
Laxmi Shanker Maurya http://orcid.org/0000-0002-0631-7274
APPLIED ARTIFICIAL INTELLIGENCE 419
References
Agarwal, K., M. Ekansh, R. Chandrima, P. Manjusha, and S. Siddharth. “Analyzing student
performance in engineering placement using data mining.” In Proceedings of International
Conference on Computational Intelligence and Data Engineering, pp.171–81. Springer,
Singapore, 2019.
Brodley, C., and P. Smyth. 1997. Applying classification algorithms in practice. Statistics and
Computing 7.
Cerquides, J., M. López-Sánchez, S. Ontañón, E. Puertas, A. Puig, O. Pujol, and D. Tost.
“Classification algorithms for biomedical volume datasets.” In Conference of the Spanish
Association for Artificial Intelligence, pp.143–52. Springer, Berlin, Heidelberg, 2005.
Elayidom, S., S. M. Idikkula, and J. Alexander. 2011. A generalized data mining framework for
placement chance prediction problems. International Journal of Computer Applications 31
(no. 3):0975–8887.
Halde, R. R. “Application of machine learning algorithms for betterment in education system.”
In 2016 International Conference on Automatic Control and Dynamic Optimization
Techniques (ICACDOT), pp.1110–14. Bangalore, India: IEEE, 2016.
Halde, R. R., A. Deshpande, and A. Mahajan. “Psychology assisted prediction of academic
performance using machine learning.” In 2016 IEEE International Conference on Recent
Trends in Electronics, Information & Communication Technology (RTEICT), pp.431–35.
Bangalore, India: IEEE, 2016.
Ishizue, R., K. Sakamoto, H. Washizaki, and Y. Fukazawa. 2018. Student placement and skill
ranking predictors for programming classes using class attitude, psychological scales, and
code metrics. Research and Practice in Technology Enhanced Learning 13 (no. 1):7.
doi:10.1186/s41039-018-0075-y.
Kabra, R. R., and R. S. Bichkar. 2011. Performance prediction of engineering students using
decision trees. International Journal of Computer Applications 36 (no. 11):8–12.
Korst, J., V. Pronk, M. Barbieri, and S. Consoli. 2019. Introduction to classification algorithms
and their performance analysis using medical examples. In Data science for healthcare,
39–73. Cham, Springer: Sergio ConsoliDiego Reforgiato RecuperoMilan Petković.
Kumar, D., Z. Satish, R. D. S., and A. S. 2019. Predicting student’s campus placement prob
ability using binary logistic regression. International Journal of Innovative Technology and
Exploring Engineering 8 (no. 9):2633–35.
Lin, Y., and J. Le Kernec. “Performance analysis of classification algorithms for activity
recognition using micro-doppler feature.” In 2017 13th International Conference on
Computational Intelligence and Security (CIS), pp.480–83. Hongkong, China: IEEE, 2017.
Liparas, D., Y. HaCohen-Kerner, A. Moumtzidou, S. Vrochidis, and I. Kompatsiaris. “News
articles classification using random forests and weighted multimodal features.” In
Information Retrieval Facility Conference, pp.63–75. Springer, Cham, 2014.
Mishu, S., and S. M. Rafiuddin. “Performance analysis of supervised machine learning algo
rithms for text classification.” In 2016 19th International Conference on Computer and
Information Technology (ICCIT), pp.409–13. Dhaka, Bangladesh: IEEE, 2016.
Shen, Y.-J., and E. L. Herr. 2004. Career placement concerns of international graduate students:
A qualitative study. Journal of Career Development 31 (no. 1):15–29. doi:10.1177/
089484530403100102.
Sreenivasa Rao, K., N. Swapna, and P. Praveen Kumar. 2018. Educational data mining for
student placement prediction using machine learning algorithms. International Journal of
Engineering and Technology (UAE) 2. 7 (no. 1.2):43–46. doi:10.14419/ijet.v7i1.2.8988.
420 L. S. MAURYA ET AL.
Swarupa, R. A., and S. Jyothi. “Performance analysis of classification algorithms under different
datasets.” In 2016 3rd International Conference on Computing for Sustainable Global
Development (INDIACom), pp.1584–1589. New Delhi, India: IEEE, 2016.
Thangavel, S. K., P. DivyaBkaratki, and S. Abijitk. “Student placement analyzer:
A recommendation system using machine learning.” In 2017 4th International Conference
on Advanced Computing and Communication Systems (ICACCS), pp.1–5. Coimbatore,
India: IEEE, 2017.
Uddin, S., A. Khan, M. Hossain, and M. Ali Moni. 2019. Comparing different supervised
machine learning algorithms for disease prediction. BMC Medical Informatics and Decision
Making 19 (no. 1):1–16. doi:10.1186/s12911-019-1004-8.
Upadhyay, N. M., and R. S. Singh. “Performance evaluation of classification algorithm in weka
using parallel performance profiling and computing technique.” In 2018 Fifth International
Conference on Parallel, Distributed and Grid Computing (PDGC), pp.522–27. Solan, India:
IEEE, 2018.
Verma, C., Zoltánillés, and S. Veronika “Age group predictive models for the real time
prediction of the university students using machine learning: Preliminary results.” In 2019
IEEE International Conference on Electrical, Computer and Communication Technologies
(ICECCT), pp.1–7. Coimbatore, India: IEEE, 2019.
Xia, Y., K. Chen, and Y. Yang. 2020. Multi-label classification with weighted classifier selection
and stacked ensemble. Information Sciences. doi:10.1016/j.ins.2020.06.017.
Zhang, Y. “Support vector machine classification algorithm and its application.” In
International Conference on Information Computing and Applications, pp.179–86.
Springer, Berlin, Heidelberg, 2012.
Web References
Accessed May 04, 2020. https://www.poornima.edu.in/role-of-university-in-students-
placement/
Accessed June 25, 2020.https://www.geeksforgeeks.org/basic-concept-classification-data-
mining/
Accessed June 27, 2020. https://www.coursera.org/lecture/machine-learning/welcome-
RKFpn/