0% found this document useful (0 votes)
2 views

Collaborative_job_prediction_based_on_Nave_Bayes_Classifier_using_python_platform

The paper presents a recommendation system for job portals using a collaborative filtering technique based on the Naive Bayes classifier implemented in Python. It analyzes user profiles and job data to suggest suitable job opportunities by calculating a similarity index using Euclidean distance and ranking them accordingly. The proposed model involves several stages including data acquisition, sanitization, and Bayesian ranking to enhance the accuracy of job recommendations.

Uploaded by

supdingo123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Collaborative_job_prediction_based_on_Nave_Bayes_Classifier_using_python_platform

The paper presents a recommendation system for job portals using a collaborative filtering technique based on the Naive Bayes classifier implemented in Python. It analyzes user profiles and job data to suggest suitable job opportunities by calculating a similarity index using Euclidean distance and ranking them accordingly. The proposed model involves several stages including data acquisition, sanitization, and Bayesian ranking to enhance the accuracy of job recommendations.

Uploaded by

supdingo123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2016 International Conference on Computational Systems and Information Systems for Sustainable Solutions

Collaborative Job Prediction based on NaIve Bayes


Classifier using Python Platform
Dr. Savita Choudhary Siddanth Koul Shridhar Mishra Anunay Thakur Rishabh Jain
Faculty CSE Dept. of CSE Dept. of CSE Dept. of ISE Dept. of CSE
Sir MVIT, Bangalore Sir MVIT, Bangalore Sir MVIT, Bangalore Sir MVIT, Bangalore Sir MVIT, Bangalore

Abstract - The paper aims to implement recommendation The Netflix Competition for the recommendation system [8]
system based on collaborative filtering technique for job highlights the combination of both content based filtering and
portals. The system is designed to suggest the jobs to the the coIlaborative filtering. Different algorithms analysis done
user depending upon his profile and by calculating a by M. Papagelis and D. Plexousakis [9] show how
similarity index using Euclidian distance of two skiD sets recommendation can be obtained depending upon the data set.
and then ranking them according to their na'ive Bayes On the similar ground, depending upon the data of jobs and
algorithm. The recommendation system has been its type, different methods were combined to obtain a suitable
implemented in python. algorithm for thejob prediction.

Keywords-Recommendation Engine, Job,ProjiJe,History,


Python II. TYPICAL RECOMMENDATION SYSTEM
L INTRODUCTION
Figure 1: Typical Recommendation System
A recommendation engine or a recommendation system helps
to predict whata user may like to see next from a list of given
items. Recommendation systems produce the results based
either on Collaborative filtering or Content based filtering
technique. 1t builds the model focused on users' past activity.

Many organizations and job portals receive large number of


applications. At the same time, database related to job seekelS
and job profiles are maintained. In recommendation system
applications are analyzed to extract the data like users' skills,
previous job history, demographic information and other
necessary details. Depending upon the extracted data, the job As shown in Figure I, a typical recommendation system
seeker is suggested with new jobs other than what is being consists of mu Itip Ie stages.
searched for. The current model is developed user's skill set.
In data acquisition stage, data from different users are
The collaborative filtering technique [I] works on user item collected and stored in database. These data consist of useIS ,
interaction wherein the recommendation is based upon the profile, past activities, browsing history.
user similarities. These similarities depend on uselS' Transfonnation stage deals with the different processes such
preferences. The technique combines all the data to create as sanitization of data and cluster formation.
ranked list of suggestion.
Computation model deals with the calculation part of the
Pazzani and Billus [13] used coUaborative filtering to give system It mainly consists of two parts - Result set generation
recommendation of webpages depending on users' profile and and Data filtering.
their ratings of a specific webpage. On the same ground, the
jobs can also be categorized depending upon the uses' profile The last stage is the Recommendation unit where
and thejobs they are related to. Na'iveBayes [2] approachwas recommendations to the users are made depending upon the
proposed by K P Murphy et al. to check for the collaborative filtered results set from computationmodule.
filtering of the webpages. NaIve Bayes [2] can also be used as
the tool for developing a recommendation engine for any Skill recommendation are based onNa'ive Bayes Classifier [2]
purpose. A pre discovered association rules was proposed by which takes an account of the skills provided by the user in
Mobasher et al. [14] which provided gave patterns based form of collaboration.
recommendations.

978-1-5090-1022-6/16/$31.00 ©2016 IEEE 302


Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 15,2023 at 13:52:23 UTC from IEEE Xplore. Restrictions apply.
2016 International Conference on Computational Systems and Information Systems for Sustainable Solutions

III. PROPOSED M ODEL Step2: Data sanitization phase is used to enhance the data for
our computational processing.

Figure 2: Block diagram of the proposed model


It involves removing of various kinds of noise retrieved from
the user profIJes. Theoutputof this phase is a sanitized setof
skills for each user [3]. Various algorithms and techniJues
were applied to create a generic set of skills from the given
input data.

Step3: Data initialization is a preprocessing stage comprBing


of two steps:

1. System initializations including various system


specifIc variables. This genera]]y deals with setting
up of the system environment and related
functionalities.
2. Cluster formation based on skill occurrences takes
place in this phase. The algorithm used to genemte
the cluster creates nC2 combinations of a]] the
connected skills and updates a global dictionalY of
skills which is utilized by other models to genemte
the fmal result set.

Step 4: Similarity index generation unit deals with the


calculation of similarity weights between two sets of skills
data in order to defme a relationship between thetwo.

Currently the following two algorithms are implemented and


analyzed basedon their performances [4].

Block diagram of the proposed model is shown in Figure 2. I) Euclidean distance


• Data is acquired from job portal websites tracks [Input: skill! and skil121 Output: similarity score]
browsing patterns of users.

a) Find the intersection set for both the skills.
Based on the data collected preferred skills are
b) Normalize the ski]] occurrences belonging to the
mapped to each user and cluster of skills is updated.
intersections setwith a normalization factor of3
• Similarity index is generatedbasedon the Euclidbn
distance between the skill sets. c) Calculate .JL�=lxi - ya2 where in Xi E X and Yi
E Y . Here X and Y denote the set of counts of
• Similar skills are then ranked using NaYve Bayes
intersected skills for the two given skills.
algorithm.

Advantages of the proposed model: 2) Pearson CoeffIcient


• M odel based on pure statistics and the count of the [Input: skilll and skill21 output: similarity score]
occurrences.
a) Find the intersection set for both the skills.
• Open Source design.
b) Normalize the skill occurrences belonging to the
• Small computational 0 verhead compared to Machine
intersections setwith a normalization factor of3
learning models.
c) For the two normalized sets of scores X and Y,
• Small data set required.
Pearson coeffIcient i s given by

IV. DETAILED DESIGN

Step 1: Data acquisition phase deals with collection of raw


data. In our current implementation, raw skill data is fetched
Where,
from various users from ajob portal that keeps a record of aD
the user preferences andtheir past activities. Currently dataset n = Number of pairs ofscores
comprises of l500 user profIles.
L XY = sum of product of paired scores

978-1-5090-1022-6/16/$31.00 ©2016 IEEE 303

Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 15,2023 at 13:52:23 UTC from IEEE Xplore. Restrictions apply.
2016 International Conference on Computational Systems and Information Systems for Sustainable Solutions

LX = sum of scores of set X Figure 3: Euclidean Error Analysis

LY = sum of scores of set Y


Euclidean Error Analysis
LX2 = sum of square of the scores in set X
0.375
0.35 0.3566
LY 2 = sum of square of the scores in set Y 0.32 �

\

0. 15
0.25
Step 5: Filtration unit deals with the removal of certain skills 0.225
from the recommendation set based on certain constraints 0.2
0.175
defIned. This operation is done in order to keep the 0.15
0.125
recommendations healthy and free of noise. In our current
8�
0. 1 0.0711
model skills with low occurrence are fIltered and add to a
global set so that they may not get recommended along with
0d? 9�o �-

0.0152 0.0045 0.0021 0.0009


the popular skills in ourfrnal result set. 1 2 3 4 5 6

Step 6: Bayesian Ranking phase uses the concept of NaiVe


Bayes ClassifIers to rank thefrnal set of recommendations to
the user. Conditional probability of two skills occurring Figure 4: Pearson Error Analysis
together is calculated in order to determine their ranking with
respect toa particular skill.
Pearson Error Analysis
This makes sure that the skills whosejobs are recommended 0.32
remain balanced over the total set of skill cluster that is 0.315
captured from the User data. Following equation is used to
0.31
frnd the Bayesian score between two skills:
0.305
0.3
0.295
Where,
0.29
Freq CJ·O 1 2 3 4 5 6
PC ·10 = - - - - ""'"''-
} Freq C O
From Fig 3 and Fig 4 It is observed that the deviation in the error
And rate is minimum using Euclidean distancealgorithm.
Number of people possessing skill j
PU) =
Totol count of users TABLE I· MSEflr similarity functions

PUIO: Represents the probability of skillj being in a user's Normalization Euclidean Pearson
profile given that skill i is already present factor Error Error
1 0.3566 0.2994
FreqU i) : Represents the number ofj-i pairs in the cluster of 2 0.0711 0.3055
skill i 3 0.0152 0.3073
4 0.0045 0.3121
FreqCi) : Represents the number of users who possess skill i
5 0.0021 0.3143
6 0.0009 0.3127
As PUIi) would not be equal to PCilj), this model uses an
Table 1 represents the correspondrng values of MSE at different
asymmetric similarity function. [5][6][7] One of the
normalization factor.
limitations of using an asymmetric similarity function is that
each item i will tend to have high conditional probabilires In Fig 3 and Fig 4 X-axis represents the normalization factor and
with items that are being purchased frequently. This solution Y-axis represents the mean square error rate which is calculated
is inspired from the inverse-document scaling performed in using formula
information retrieval systems.
1 n

v. RESULTS
MSE =
n LCW' - W)2
In order to get optimum results an optimum error analysis i=l

has to be chosenwhose results stabilize after a certain MSE= Mean Squared Error
normalization factor.
n= Sample space

CW' - W) = Deviation.

978-1-5090-1022-6/16/$31.00 ©2016 IEEE 304

Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 15,2023 at 13:52:23 UTC from IEEE Xplore. Restrictions apply.
2016 International Conference on Computational Systems and Information Systems for Sustainable Solutions

As compared to the Pearson coefficient in which the deviations TABLE 3: Weightage table of Python
keep on increasing with increase in the normalization factor,
conversely Euclidean distance algorithm provides better results
for our application.

From Table I and Fig 3 it is also observed thatthe graph stabili2es


after normalization factor 3 hence it is the optimum value that
=

should be set to obtain consistent results.

Bayesian weights as shown in above figures specifY a weighted


relation based on their occurrences between a set of skills with a
given skill. These weights are used to rank these skills in the
decreasing weighted order. The above figures depict the Bayesian
weights of skills belonging to Javaand Python clusters.

TABLE 2: WEIGHT AGE TABLE FOR JAVA

Figure 6: Weightage Chart JOr Python

Figure 5: Weightage Chart JOr Java

TABLE 4: ACCURACY TABLE FOR DIFFERENT DIVISION RATIO

Division Ratio Division Ratio

� Html
Pass
0.91
= 0.2
Fail
0.09
Pass
0.9
= 0.25
Fail
0.1
MySQI 0.9 0.1 0.93 0.07
C 0,98 0.02 0,97 0.03
C++ 0.94 0.06 0,93 0,07
SQL 0.85 0.15 0.82 0.18

As shown in Table 4, the ratio factor with which the input


dataset is divided into training and testing datasets is defmed
by the division ratio. A set of recommendation is considered
to be passed if any one of the skill is recommended by our
system is present in the test case.

978-1-5090-1022-6/16/$31.00 ©2016 IEEE 305

Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 15,2023 at 13:52:23 UTC from IEEE Xplore. Restrictions apply.
2016 International Conference on Computational Systems and Information Systems for Sustainable Solutions

in information search and retrieval".


For a given division ratio , the testing set is obtained by [7] John S Breese et aI., "Empirical analysis of predictive
extracting out the respective factor of total records from the algorithms for collaborative filtering", In Proceedings of the
given input set and the rest of the records are taken as the Fourteenth conference on Uncertainty in artificial intelligence,
training set. pp. 43-52. Morgan Kaufmann Publishers Inc., 1998.
[8] R. M. Bell and Y. Koren, "Lessons from the Netflix prize
A simple formula based explanation can be as follows: challenge" ACM SIGKDD Explorations Newsletter, vol. 9,
no. 2, p. 75, Dec. 2007.
Test Set = InputSet [O: len(inputSet)* divRatio] [9] M. Papagelis andD. Plexousakis, "Qualitativeanalysis of
Train Set = InputSet[len(InputSet)*divRatio: len(InputSet)] user-based and item-based prediction algorithms for
recommendation agents" Engineering Applications of
The testing proceeds by training the model with the training Artificial Intelligence, vol. 18, no. 7, pp. 781-789, Oct. 2005.
set obtained and validating the model with respect to the [10] Fu, Xiaobin, Jay Budzik, and Kristian 1. Hammond.
testing set. "Mining navigation history for recommendation. " In
For a given input skill if any one of the recommended skills Proceedings of the 5th international conference on Intelligent
by the model is present along with the input skill in the test user interfaces, pp. 106-112. ACM, 2000.
records, the record is taken to be passed otherwise fail. [11] G. Linden, B. Smith, and J. York, "Amazon.com
recommendations: Item-to-item collaborative filtering," IEFE
In our testingmodel, division setis obtained for two division Internet Computing, vol. 7, no. 1, pp. 76-80, Jan. 2003.
ratios viz 0.2, 0.25. This was done in order to validate the [12] Zhou, Yunhong, Dennis Wilkinson, Robert Schreiber,
results with different sizes of training and testing dataset. The and Rong Pan. "Large-scale paraIJel collaborative filtering for
results have been tabulated. the netflix prize. " In Algorithmic Aspects in Information and
Management, pp. 337-348. Springer Berlin Heidelberg, 2008.
[13] M. Pazzani and D. Billsus, "Learning and revising User
VI. CONCLUSION Profiles: The identification of Interesting Web Sites,"
Machine Learning, Arlington, 27, pp. 313-331, 1997.
Similar research [15][16][17] on hybridjob recommendations [14] B. Mobasher, H. Dai, T. Luo, and M. Nakagawa,
has been done as a proof of concept, however these resean::h "Effective personalizaton based on association rule discovel)'
cannot be directly compared since their design i5 from Web usage data," ACM Workshop on Web information
fundamentally different. and data management, Atlanta, GA, Nov. 2001.
[15] Lu, Yao, Sandy EI Helou, and Denis Gillet. "A
The designed system is able to successfully recommend jobs recommender system forjob seeking and recruiting website."
based on a user's current skill set by combining it with the In Proceedings of the 22nd International Conference on World
similar skills in the global data set that we have acquired from Wide Web, pp. 963-966. ACM, 2013.
the analysis of whole input set, we conclude that for the [16] Burke, Robin. "Hybrid recommender systems: Survey
Division Ratio 0.2 the average accuracy of the system for all and experiments. " User modeling and user-adapted
the jobs is 91.33% and for Division Ratio 0.25 the avemge interaction 12, no. 4 (2002): 331-370.
accuracy of the system for all the jobs is 92.74%. [17] Bakar, Awraini Abu, and Choo-Yee Ting. "Soft skills
recommendation systems for IT jobs: A Bayesian network
approach. " In Data Mining and Optimization (DMO), 2011
3rd Conference on, pp. 82-87. IEEE, 2011.
VII. REFERENCES

[1] B Sarwar et aI., "Item-based collaborative filtering


recommendation algorithms ", pp. 285-295, Jan 2001.
[2] KP Murphy et al., "Naive bayes classifiers", University of
British Columbia, 2006.
[3] 0 Nasraoui et aI., "An intelligent web recommendation
engine based on fuzzy approximate reasoning", In Fuzzy
Systems, 2003. FUZZ'03. The 12th IEEE International
Conference on, vol. 2, pp. 1116-1121. IEEE, 2003.
[4] PE Danielsson et aI., "Euclidean distance mapping",
Computer Graphics and image processing 14, no. 3, pp. 227-
248, 1980.
[5] Mukund Deshpande et aI., "Item-Based Top-N
Recommendation Algorithms", ACM Transactions on
Information Systems (TOIS) 22, no. 1, pp. 143-177, 2004.
[6] G Salton et aI., "On the use of clustered file organization

978-1-5090-1022-6/16/$31.00 ©2016 IEEE 306

Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 15,2023 at 13:52:23 UTC from IEEE Xplore. Restrictions apply.

You might also like