Research On Resume Classification System Based On Graph Neural Network
Research On Resume Classification System Based On Graph Neural Network
Abstract—In this paper, we investigate an innovative resume technology to improve the accuracy and efficiency of resume
classification method that utilizes heterogeneous graph neural classification has become an urgent problem to be solved.
networks (HGNN) to construct resumes, jobs, and job categories
into a heterogeneous graph, aiming to improve the accuracy and
efficiency of resume classification. In this system, the resume is
regarded as the target node, and a heterogeneous graph network
is constructed by connecting with the nodes of related jobs and job
categories. Using graph neural network model for feature
aggregation and classification, the model can effectively capture
the complex relationship between nodes, so as to achieve accurate
classification of resumes. The experimental results show that the
proposed method is significantly superior to the traditional
methods in classification performance, and provides a new idea Fig. 1. The basic connection structure in the constructed heterogeneous graph
and technical support for the design of resume classification
system. This paper proposes an innovative resume classification
method that uses Heterogeneous Graph Neural Network
Keywords—Resume classification, Heterogeneous mapping, (HGNN) to construct resumes, jobs and job categories into a
Deep learning, Graph neural network heterogeneous graph network to improve the accuracy and
efficiency of resume classification. In this system, the resume is
I. INTRODUCE treated as the target node, which forms a complete
Nowadays, with the rapid development of information heterogeneous graph structure by connecting with related job
technology, enterprises are facing increasingly severe challenges and job category nodes, in which the connection relationships
in the recruitment process. With the explosion in the number of are shown in Figure 1. First, three types of nodes are defined,
job seekers, enterprises need to efficiently and accurately screen namely, resume node, job node and job category node, which
out suitable candidates, however, traditional resume represent the resume of the applicant, the specific job position
classification methods are difficult to meet this demand. Existing and the job category respectively. Then, edges between nodes
methods often rely on manual feature engineering and rule-based are defined to represent the relationship between nodes. For
models, which do not perform well in dealing with complex example, the edge between the resume node and the job node
relationships and high-dimensional sparse data. Specifically, indicates that the job seeker has held the job, the edge between
resumes are diverse in format and content, which makes it the resume node and the job category node indicates that the job
difficult to automate; Resume information involves a variety of seeker meets the requirements of the job category, and the edge
entities and relationships, such as educational background, work between the job node and the job category node indicates that
experience, skills and project experience, and there are complex the job belongs to a certain category. After building a good and
interactions and dependencies among these information; unusual graph, HGNN model is used to aggregate and classify
Resume data is usually high-dimensional and sparse, which is the features of the nodes in the graph. The feature vector of each
difficult to process effectively with traditional machine learning node is initialized. The features of the resume node include job
methods [1]; The information in the resume is highly semantic, seekers' education background, work experience, skills and other
and how to understand and deal with the semantic information is information, and the features of the job node and job category
also a big problem [2]. Therefore, how to use advanced node include job description, category characteristics and other
information. The HGNN model is used to aggregate the features
Authorized licensed use limited to: Dayananda Sagar University. Downloaded on March 13,2025 at 05:08:35 UTC from IEEE Xplore. Restrictions apply.
of each node, consider the relationship between nodes and the [5,6]. When applied to resume information extraction scenarios,
influence of different types of nodes and edges, so as to carry out this technique processes the input resume text through deep
the classification task of resume nodes and classify resumes into learning sequence models, such as short term memory network
corresponding job categories. (LSTM) [7]. LSTM can serialize the label assignment for each
word in the sentence, so as to identify and classify the entity and
In order to verify the effectiveness of the proposed method, event information, and realize the fine extraction and
a large number of experiments have been carried out. The classification of information. For example, Akihiro Katsuta and
experimental results show that the CV classification system other researchers regard resume analysis as NER task,
based on HGNN is significantly superior to the traditional comprehensively annotate all parts of the resume text,
methods in classification accuracy and efficiency. The specific effectively extract labels and their associated content, and
performance is as follows: First, in the classification accuracy, achieve substantial research results. In addition, some scholars
the method in this paper has significantly improved, and it can further optimized the method by using the combination of
classify resumes into correct job categories more accurately; bidirectional LSTM (BiLSTM) and conditional random field
Secondly, the HGNN model can effectively capture the complex (CRF) models, and integrating the Attention mechanism [8,9].
relationships between nodes, and it is particularly good in the This innovative design helps to break out of the local optimal
processing of resume data involving multiple entities and
solution and significantly improves the model performance.
relationships. Finally, the method presented in this paper has Based on the fusion architecture of BiLSTM and CRF, another
high efficiency when dealing with a large amount of resume study integrated the features of convolutional neural network
data, and can quickly complete the classification task. (CNN) [10], and through comparative experiments, strongly
The main contributions of this paper include: proved the superior performance of this BILSTM-CNNS-CRF
model in the processing of resume entity annotation tasks. It
1. Build a heterogeneous graph network and make full use of demonstrates the great potential of deep learning technology in
the relationship information between nodes; The HGNN model improving the accuracy of resume information extraction.
is introduced for feature aggregation and classification, and the
complex relationships between nodes are captured effectively. Text classification is an important problem in natural
language processing (NLP). From the concept of text
2. The effectiveness of the proposed method is verified by a classification in the 1960s to the mature application of text
large number of experiments, and the results show that the classification algorithms in 2020, the field has gone through
proposed method is significantly superior to the traditional many stages of development. In the early stage, shallow
method in classification performance. learning models based on statistics, such as K-nearest neighbor
Future work can further optimize the model structure, (KNN), naive Bayes (NB) and support vector machine (SVM),
explore more types of node and edge relationships, and apply to were widely used in text classification tasks [11]. Since 2010,
larger data sets. At the same time, consider combining other the research of deep learning model in the field of text
advanced deep learning techniques with HGNN to further classification has gradually emerged. These models range from
improve the performance of the resume classification system. simple to complex and include multi-layer perceptrons (MLPS),
text convolutional neural networks (TextCNN), recurrent
II. RESEARCH STATUS neural networks (RNNS), and others. In 2017, the birth of the
There has been some work focused on how to quickly Multi-head Self Attention mechanism and the Transformer
identify information in resumes and achieve accurate model brought new ideas to text classification. Pretrained
classification purposes. In 2018, Zhang Bo's research [3] BERT (Bidirectional Encoder Representation from
introduced a novel resume analysis method in the literature, Transformers) model based on these techniques has very large
which is based on the domain knowledge base. This method scale parameters, which has become a research hotspot in the
starts with the use of trigger word matching and word2vec word field of text classification. In previous studies, traditional
vector technology to extend and realize the initial segmentation classification algorithms such as SVM are usually used for
of the resume, and then subdivide the resume into several resume information extraction. However, in the context of text
thematic blocks. The conditional random field model was then classification, the application of deep learning classification
used to accurately label the entities in these blocks, a series of model in resume information extraction task is relatively small.
operations that eventually achieved positive practical results. At present, resume information extraction is developing
Another exploration path is a rule-based information extraction towards the combination of neural network and rules. This
strategy, which focuses on direct comparison of full text and method first divides the resume content into blocks according
subtly avoids the error accumulation problem that may be to rules, uses pre-trained machine learning or deep learning
introduced by CV segmentation. kopparapu, Sunil Kumar et al. multi-classification models to predict the categories of blocks,
[4] designed a resume information extraction system based on and then formulates corresponding methods according to the
this idea. The system implements a two-step strategy: The first predicted categories to extract specific field information in
step is to build a customized extraction rule set according to the specific blocks. Regular expressions and statistical models can
characteristics of different parts of the resume; The second step be used to extract information in blocks. Or treat it as NER task
is to apply these detailed rules comprehensively to every section combined with LSTM deep learning model and CRF [12-13].
of your resume, performing rigorous pattern matching to
capture the information you need efficiently and accurately.
Named entity recognition (NER) techniques involve identifying
and labeling specific entities or categories of events in text
1139
Authorized licensed use limited to: Dayananda Sagar University. Downloaded on March 13,2025 at 05:08:35 UTC from IEEE Xplore. Restrictions apply.
III. MODEL multi-layer aggregation. Since all nodes are in the same feature
In order to avoid the deficiency of metapath and extract node space after feature transformation, we can perform feature
features with sufficient semantics and heterogeneous aggregation on adjacent nodes according to the following
information, a new FEMG model is proposed, which consists of formula:
node feature transformation and metagraph feature aggregation. 1 (1)
,i D AGi XW
(1)
Specifically, due to the heterogeneity of heterogeneous nodes, H mg (2)
the features of different nodes are distributed in different feature
(1)
Spaces. We first projected the features of different nodes into the Where H mg ,i is the node feature representation obtained
same feature space through feature transformation, and at the
same time divided the heterogeneous maps into different after the first aggregation in the metagraph, AGi is the critical
metagraph regions in the pre-processing stage, and then carried
out feature aggregation within the regions to obtain the feature matrix of the metagraph Gi , D 1 is the inverse matrix of the
representations of different regions. Moreover, the features of degree matrix of AGi ,
X is the node feature after L2
different regions are fused to generate the meta-graph features of
(1)
nodes and embed them. The specific process is shown in Figure regularization, and W is the learnable critical matrix.
2
In this way, we learn about the neighbor nodes directly
adjacent to the target node, but the neighbor information of 1-
hop often cannot meet the needs of heterogeneous graphs,
because nodes of the same type are often not adjacent. Therefore,
we can deepen the depth of the above aggregation method to
obtain higher order neighbor information. Specifically, the node
characteristics after stacking layers can be expressed as:
1 ( l 1)
,i D AGi H mg W
(l ) (l )
H mg (3)
After stacking the l layers, we obtain the higher-order
feature representation in the metagraph. However, due to the
Fig. 2. FEMG frame diagram over-smoothing problem of graph neural network, the node
features are similar after multi-layer aggregation, resulting in
A. Node feature transformation indistinct phenomenon. Therefore, we adopt the residual
connection method to retain the original features of nodes to
Due to the heterogeneity of heterogeneous nodes, nodes of avoid the interference of over-smoothing. Then the node features
different types are often distributed in different feature Spaces. of the layer can be expressed as:
In order to facilitate the operation of nodes of different types, we
set a type-specific transformation matrix to project different 1 ( l 1) ( l 1)
,i D AGi H mg W H mg
(l ) (l )
H mg (4)
nodes to the same feature space.
In this way, we obtain the node feature representation in the
H WA X (1) metagraph.
Where 𝑊 ∈ 𝑅 ') is the feature transformation matrix C. Intermetagraph feature aggregation
corresponding to type 𝒜 is the original feature dimension of After obtaining the feature representation
H mg H of different metagraphs, we use the
class 𝒜 nodes, and 𝑑 is the dimension after feature projection of (l ) (l )
all types of nodes.. mg ,1 ,...H mg ,s
1140
Authorized licensed use limited to: Dayananda Sagar University. Downloaded on March 13,2025 at 05:08:35 UTC from IEEE Xplore. Restrictions apply.
exp( i ) classification task under different data division ratios. With the
atti s
(6) increase of the proportion of training data, the accuracy of all
exp( )
j 1
j
models is improved to varying degrees. Specifically, the
accuracy rate of HAN in the 20% data division is 90.45%, and
as the proportion increases to 80%, the accuracy rate increases
After obtaining the attention values of different regions, we
to 91.45%. The accuracy of Ie-HGCN was 90.92% in 20% data
can calculate the final feature embedding of the node's
division and 92.26% in 80% data division, which was superior
metagraphs region.
to HAN in almost every data division ratio. The accuracy of
s ROHE was 90.57% when the data was divided by 20%, and
H mg atti H mg
(l )
,i (7) increased to 92.06% when the data was divided by 80%, which
i 1 was slightly better than HAN but inferior to Ie-HGCN. FEMG
performed best in all data partitioning ratios, reaching 92.06%
IV. EXPERIMENT accuracy at 20% data partitioning and 93.34% at 80% data
partitioning, which was significantly higher than other models.
A. Data set information Overall, FEMG performed best in the resume classification task,
The dataset used in this experiment is a comprehensive and its classification accuracy was ahead of other models
dataset of resumes, jobs, and job categories derived from public regardless of the data division ratio, indicating significant
job boards and internal company recruitment systems, covering advantages in feature enhancement and heterogeneous graph
a wide range of industries and job categories. Resume data is the processing. With the increase of the proportion of training data,
core part of the experiment, including the basic information of the classification performance of all models improved, which
job seekers, educational background, etc., after cleaning and further verified the importance of more training data for model
preprocessing, word2vec is used to generate feature vectors to performance.
ensure the quality and consistency of the data. The job data
contains detailed job information and is also rigorously pre- D. Clustering experiment
processed. Job category data represents job classification
TABLE II. EXPERIMENTAL RESULTS OF CLUSTERING OF DIFFERENT
information, such as industry category and job category, which MODELS
helps to build a more comprehensive heterogeneous graph
structure. When constructing a heterogeneous graph, resume, job Model NMI ARI
and job category are regarded as three different types of nodes, HAN 0.6974 0.7172
Ie-HGCN 0.4905 0.3447
and a complete heterogeneous graph structure is formed by
ROHE 0.6879 0.7206
defining the relationship between resume and job, resume and FEMG 0.7004 0.7312
job category, and job and job category.
B. Baseline Table II shows the experimental results on clustering of the
HAN: This model was the first to apply attention centralized model. It can be seen from the table that FEMG
mechanisms to heterogeneous graphs, transforming the model has the highest score on NMI and ARI index, 0.7004 and
heterogeneous graph into multiple homogeneous graphs. It 0.7312 respectively. The second is HAN model, which is 0.6974
generates node embeddings through hierarchical aggregation. and 0.7172 respectively, and the performance is slightly worse
than FEMG model. ROHE model is slightly worse than FEMG
ie-HGCN: This model utilizes specific type-level attention
and HAN in NMI index, but it is comparable to HAN model in
mechanisms to effectively discover meta-paths in
ARI index, 0.6879 and 0.7206 respectively. The Ie-HGCN
heterogeneous graphs. It aggregates meta-paths to obtain the
model has the worst performance on both indicators, 0.4905 and
final embeddings.
0.3447 respectively. In summary, FEMG model performs the
ROHE: This model is a robust heterogeneous graph neural best in the resume classification task and has the highest NMI
network that uses attention purifiers to aggregate the top T most and ARI indicators, indicating that it is superior to other models
important neighboring nodes. It prunes malicious neighbors in clustering accuracy and clustering result consistency.
based on both topology and features, enhancing the model's
robustness.
C. Classification result
1141
Authorized licensed use limited to: Dayananda Sagar University. Downloaded on March 13,2025 at 05:08:35 UTC from IEEE Xplore. Restrictions apply.
Further, we studied the visual effect diagram of the model, REFERENCES
and the result is shown in Figure 3. It can be seen that FEMG [1] Chen M, Huang C, Xia L, et al. Heterogeneous graph contrastive learning
can distinguish nodes of different categories well, while the for recommendation[C]//Proceedings of the sixteenth ACM international
intra-class relationship is tight and the distinction between conference on web search and data mining. 2023: 544-552.
classes is obvious, showing good experimental results. [2] Yang X, Yan M, Pan S, et al. Simple and efficient heterogeneous graph
neural network[C]//Proceedings of the AAAI Conference on Artificial
V. CONCLUSION Intelligence. 2023, 37(9): 10816-10824.
[3] Katsuta A, Hanjaya H A, Asati S, et al. Infomation extraction from english
In this paper, an innovative resume classification method & japanese résumé with neural sequence labelling methods[J]. 2018.
based on heterogeneous graph neural network (HGNN) is [4] Veličković P. Everything is connected: Graph neural networks[J]. Current
proposed, and its effectiveness is verified in experiments. In this Opinion in Structural Biology, 2023, 79: 102538.
method, resumes, jobs and job categories are constructed into a [5] Zhonghe He, Zhongcheng Zhou, Liang Gan, et al. Chinese entity
heterogeneous graph network, and the graph neural network attributes extraction based on bidirectional lstm networks. International
model is used to perform feature aggregation and classification Journal of Computational Science and Engineering, 2019. 18(1):65–71.
tasks, so as to improve the accuracy and efficiency of resume [6] Li Wei. Research and implementation of Chinese resume analysis system
classification. The experimental results show that the proposed based on deep neural network. [Master's Thesis]. Chongqing: Chongqing
University of Posts and Telecommunications, 2018.
method is significantly superior to the traditional method in
classification performance, and achieves accurate classification [7] Chen Yi. Application of deep learning in resume analysis. [Master
dissertation]. Chongqing: Chongqing University of Posts and
of resumes by making full use of the complex relationship Telecommunications, 2019.
between nodes. In particular, the experimental results show that [8] Zu Shicheng, Wang Xiulai, Cao Yang, et al. Resume analysis based on
the classification accuracy and clustering performance of the new text block segmentation method. Computer Science, 2020. 47(6A):
proposed method are superior to the traditional method under 95-101.
different data partition ratios, and the NMI and ARI indexes are [9] Shao Y, Li H, Gu X, et al. Distributed graph neural network training: A
excellent. Therefore, this study provides new ideas and survey[J]. ACM Computing Surveys, 2024, 56(8): 1-39.
technical support for the design of resume classification system, [10] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network
and provides a more efficient and accurate candidate screening for text classification with multi-task learning. arXiv preprint
arXiv:1605.05101, 2016.
method for the recruitment process of enterprises. The future
research direction can further optimize the model structure and [11] Dwivedi V P, Joshi C K, Luu A T, et al. Benchmarking graph neural
networks[J]. Journal of Machine Learning Research, 2023, 24(43): 1-48..
algorithm, and explore more complex relationship modeling
[12] Tsitsulin A, Palowitch J, Perozzi B, et al. Graph clustering with graph
methods to further improve the performance and practicability neural networks[J]. Journal of Machine Learning Research, 2023,
of the resume classification system. 24(127): 1-21.
[13] Rusch T K, Bronstein M M, Mishra S. A survey on oversmoothing in
graph neural networks[J]. arXiv preprint arXiv:2303.10993, 2023.
[14] Zhang M, Wang X, Zhu M, et al. Robust heterogeneous graph neural
networks against adversarial attacks[C]//Proceedings of the AAAI
Conference on Artificial Intelligence. 2022, 36(4): 4363-4370.
1142
Authorized licensed use limited to: Dayananda Sagar University. Downloaded on March 13,2025 at 05:08:35 UTC from IEEE Xplore. Restrictions apply.