0% found this document useful (0 votes)
46 views10 pages

Automatic Learning Path Recommendation For Open Source Projects Using Deep Learning On Knowledge Graphs

This paper presents a method for automatically recommending learning paths for developers contributing to open source projects by utilizing deep learning and knowledge graphs. The approach constructs knowledge graphs from various data sources within open source communities and employs algorithms to suggest relevant learning paths based on developers' contribution goals. Experimental results indicate that this method significantly reduces the time needed for developers to understand project code while maintaining accuracy in the recommended paths.

Uploaded by

zifanrye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views10 pages

Automatic Learning Path Recommendation For Open Source Projects Using Deep Learning On Knowledge Graphs

This paper presents a method for automatically recommending learning paths for developers contributing to open source projects by utilizing deep learning and knowledge graphs. The approach constructs knowledge graphs from various data sources within open source communities and employs algorithms to suggest relevant learning paths based on developers' contribution goals. Experimental results indicate that this method significantly reduces the time needed for developers to understand project code while maintaining accuracy in the recommended paths.

Uploaded by

zifanrye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)

Automatic Learning Path Recommendation for Open


2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC) | 978-1-6654-2463-9/21/$31.00 ©2021 IEEE | DOI: 10.1109/COMPSAC51774.2021.00115

Source Projects Using Deep Learning on Knowledge


Graphs
Hang Yin, Zhiyu Sun, Yanchun Sun*, Gang Huang
Key Laboratory of High Confidence Software Technologies, Ministry of Education
Department of Computer Science and Technology
Peking University
Beijing, China
{yhang1996, sunzypku, sunyc, hg}@pku.edu.cn

Abstract—Open source is an important way for developers to When developers are trying to commit a contribution to an
collaborate on software development. More and more developers existing open source project, the first thing they need to do is to
begin contributing to open-source projects. When a developer read and understand the project code according to their goals of
begins to contribute to an existing open source project, the first contribution. However, it is difficult for developers to directly
thing to do is to read and understand the project code. However, find the source code related to their goals. The main features of
most current open source projects only provide API any current open source community (such as GitHub [1]) focus
documentation, not project design documents for new developers. on version control and project management, such as discussing
Developers can only understand the code based on scattered issues, reviewing pull requests, etc.. These features don’t
comments in the code, which are difficult for new comers.
contain the feature of project code knowledge management
Therefore, developers need to find a learning path, which helps
them understand the project and finish their contribution tasks
needed by new developers. At the same time, most of open
quickly. In order to help developers find the learning path easily source projects only maintain user documentation, and lack of
and quickly, this paper puts forward a method to automatically documentation for developers. New developers can only learn
recommend learning paths of open source projects. It uses the project code by reading the code of the project gradually.
multiple data sources in an open source community to extract Functions are the core part of the source code. Most of the
knowledge data and build knowledge graphs for open source source code which developers would like to contribute to is
projects. After that, based on a deep-learning-based knowledge related to specific functions. Therefore, developers always need
graph embedding model and a path recommendation algorithm,
to learn and understand specific functions. However, locating
the method recommends proper learning paths for developers. We
select three well-known open source projects, including Lua,
and reading the function are not enough, because functions are
Memcached and TensorFlow, according to language, scope and always embedded into a function call path to implement a
community activity, as cases to verify our method, and do feature. Usually the developer needs to start from the outermost
comparative experiments between the learning paths found by layer of the program, gradually reads in depth along the function
real developers and recommended by the method. Experiment call to the target function, and learns the entire function and its
results show that our method saves developers a lot of time while position in the program through the call path. This function call
ensuring the accuracy of the recommended learning path. path that the developer goes through during the learning process
can be considered as a learning path for the open source project.
Keywords—Learning path recommendation, Open source For example, the developer can start from a relevant function in
project analysis, Deep learning, Knowledge graph the unit test, go deeper into the specific function tested, and
understand the details of the operation of the function called by
I. INTRODUCTION the test case. During learning this learning path, the quality of
Open source means multiple developers develop together the learning path could be highly concerned. To help developers
through a freely available source code. It is an important way for learn the project more quickly, a proper learning path should not
developers to collaborate on software development. Open contain functions which are not highly related to the main
source projects can not only improve the robustness and safety functionality of the project. Recommending proper learning
of the software, but also help participating developers improve paths is therefore a critical task. But so far, there is no good
development abilities. With the rapid development of open enough solution to help developers find proper learning paths .
source communities in recent years, popular open source
In order to solve the problem above, we propose a learning
projects are not only used by individual developers, but also
path recommendation method for open source projects to help
introduced by enterprises as part of their technology stacks.
developers understand the knowledge they need for
Since there are seldom software analysis and design documents
development easier, so as to commit contributions more quickly.
in open source communities, it is a huge challenge for new
The method is composed of three parts. Firstly, the method
developers, who want to participate in the contribution of open
analyzes and constructs the knowledge graph of an open source
source projects, to learn and get familiar with the source code.
project based on the information including code, documents,
discussions and comments of the project. Secondly, the method
* Corresponding author.

978-1-6654-2463-9/21/$31.00 ©2021 IEEE 824


DOI 10.1109/COMPSAC51774.2021.00115
Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on October 14,2024 at 11:41:37 UTC from IEEE Xplore. Restrictions apply.
fuses multiple knowledge graphs for different versions of the The traditional source code analysis methods can be divided
project and builds the knowledge graph embedding vectors into dynamic analysis and static analysis. Among them, the
using a deep learning model, and further calculates the dynamic analysis methods need to execute programs, and
knowledge graph node similarity as the weight of the path in the
analyze the source code according to the information of
graph. Finally, the method recommends the learning path
according to developers’ contribution goals automatically. The program operations. For example, Anderson et al. propose a
method proposed in this paper can help developers learn open malware detection algorithm, which is based on the analysis of
source projects more quickly, and promote the contribution graphs constructed from dynamically collected instruction
activeness of developers in open source communities. traces of the target program [2]. Cohen A et al. present a novel
The main contributions of this paper are as follows. approach for trusted detection of ransomware in virtual servers
on an organization's private cloud [3]. In contrast, the static
1. First, this paper proposes a method for representing multi- analysis methods can be directly analyzed based on the source
dimensional knowledge of an open source project by code itself, which is more convenient and faster than the
constructing a knowledge graph. The method dynamic analysis methods. For example, Thomé et al. propose
automatically extracts program structure information from an integrated approach for injection vulnerability detection [4].
the open source project code using static code analysis, They use static analysis to extract minimal program slices
combines with multi-source information from multiple relevant to the security of Web programs and to generate attack
versions of the project (such as issues, commits, pull conditions, and then apply a hybrid constraint solving to
requests, etc.), and generates graph embedding vectors
determine the satisfiability of attack conditions to further detect
using a deep learning model. As far as we know, our
knowledge graph is the first to combine multi-source vulnerabilities. Although dynamic analysis can identify various
information from multiple-version open source project. information in the code more accurately, we choose to use static
2. Second, this paper proposes a method for recommending analysis. For the scenario in which developers learn open
learning paths for open source projects based on the source projects, running the open source projects is usually
knowledge graph generated. Based on Dijkstra algorithm more complicated than analyzing source code statically, and
and depth-first search algorithm, the method helps static analysis can also get analysis results more quickly.
developers learn the functions related to their contribution From the perspective of static analysis granularity, most of
goals more quickly. To the best of our knowledge, our the studies have focused on statement granularity and function
method is the first to help new developers learn open granularity. For those studies that analyze the source code of
source projects by recommending the learning paths based
statement granularity, their usual practice is to extract the
on multi-version knowledge graphs.
3. Third, this paper selects three well-known open source syntax tree of the code at first, and then process and analyze the
projects according to language, scope and community syntax tree. Habib et al. propose a method to automatically
activity to experiment on the method, and does classify code classes as thread-safe or thread-unsafe [5]. They
comparative experiments on developers with different combine a lightweight static analysis method with a graph-
development backgrounds to verify the feasibility and the based classifier to extract syntax trees and call graphs for the
effectiveness of the method. The experiment results show classes, then use SVM (Support Vector Machine) to perform
that our method can save developers a lot of learning time. the classification. Tu et al. propose a context-aware approach to
assisting fault comprehension and identification [6]. Built on
The rest of the paper is organized as follows. Section II risk assessment results, this approach searches for the faults in
introduces related work. Section III presents our method. a weighted call graph generated by static analysis. In addition,
Section IV discusses the experiments and results. Finally,
the function call relationship in the program often contains a lot
Section V concludes this paper.
of semantic information, and developers can better understand
II. RELATED WORK the relationship between a program’s overall design logic and
This section mainly introduces the existing literature in the functional modules through the call graph of the program.
fields of source code analysis, open source project analysis and There are also many related studies focusing on program
knowledge graph analysis. function call graphs. Ali et al. prove that analyzing complied
A. Source Code Analysis JVM bytecode works well for generating call graphs of the
As developers increasingly use and contribute source code program written by JVM-hosted languages such as Groovy,
in open source communities, tremendous amount of code bases Scala and OCaml [7]. Gharibi et al. present a tool called
and programming data have been accumulated. In recent years, Code2graph for generating call graphs and execution path
analyzing source code and exploring the potential value in similarity matrices for Python, which can be used as the basis
source code to assist project development has attracted a wide for further deep learning training [8].
range of research interests. On the other side, at present, the program analysis
approaches of function granularity mostly focus on the analysis

825

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on October 14,2024 at 11:41:37 UTC from IEEE Xplore. Restrictions apply.
of the program itself, such as risk analysis and defect analysis. the approach. SLAMPA is a tool that uses neural networks to
Gascon et al. propose an embedding method based on call write source code snippets [16]. It first uses neural language
graphs and use machine learning to extract a structure graph to models to infer developers’ programming intent, then retrieves
identify the code structure of the malware [9]. Trapp et al. the source code snippets from code bases and recommends
divide the program into different parts and permissions them to developers.
according to functions based on static analysis results and call There are also many other studies focusing on learning-
graphs to improve the security of the program [10]. related recommendation for developers, which can improve the
The studies mentioned above on source code analysis learning efficiency of developers. Sun et al. propose a method
mainly focus on the use of source code itself, and the for building code knowledge graphs based on open source
granularity of analysis is usually at code segment level. These projects and recommending learning paths to developers [17].
studies mainly utilize source code analysis to find errors or to Prabhakar et al. propose a recommendation system for
find code logic for specific functions. However, for an open developers to match the developers who are mutually interested
source project, how to make it easier for developers to learn and in each other in potential [18]. The system can improve the
use is more important. Especially, how to make it easier for communications between different developers. Dai et al.
developers in an open source community to participate in and propose a supporting system which can recommend a personal
contribute source code is a key to maintaining the long-term learning path of different learning objectives for individual
development of the open source project. At present, there are learners [19]. Chang et al. develop a data-driven learning
relatively few studies focusing on developers learning open interest recommendation system [20]. Pang et al. propose a new
source project. recommendation model with learner neighbors and learning
series, called RLNLS [21]. These studies currently mainly
B. Open Source Project Analysis
focus on assisting smart development and collaborative
With the growth of an open source community, the development of developers based on the open source
community itself has accumulated a wealth of development community. However, there are few studies to help developers
data. Through analyzing and mining the development data, learn open source projects, the data sources used in these
researchers can help developers learn and develop projects studies are relatively limited and only contain single-version
better. At present, there are many studies on how to better use project information, and the support for developer learning is
the information in open source projects to assist developers. also very limited.
Many studies view source code in an open source community
as a document used for searching. They assist developers in C. Knowledge Graph Analysis
searching source code in open source projects. Zou et al. In recent years, the construction and the application of
propose a novel approach based on graph embedding to search knowledge graphs have grown rapidly. People have created a
source code in an open source project [11]. They extract large number of knowledge graphs and successfully applied
properties and definitions in the source code to build a code them in many practical applications, such as: Freebase [22],
graph and use the embedding vector of the code graph for code DBpedia [23], YAGO [24] and NELL [25].
Open source projects contain a lot of knowledge which is of
search. CodeHow is an approach to recognizing potential APIs
great help to developers’ learning process of open source
and understanding the potentially relevant APIs [12]. It expands projects, such as source code, Issues, Commits, etc. This paper
queries with the APIs and performs code retrieval by applying utilizes the knowledge of a specific open source project in an
the Extended Boolean model, which considers the impact of open source community to build a knowledge graph for a
both text similarity and potential APIs on code search. Lin et al. specific open source project, and further generates an
find that in large-scale open source projects, developers would embedding vector for each node in the knowledge graph using
face the gap between the words used in querying and using a knowledge graph embedding model for later learning path
documents when they want to learn APIs. They propose an recommendation. This type of embedding model is called a
approach to improving searching API learning content which translational distance model. TransE is a representative
leverages software-specific knowledge [13]. translational distance model [26]. The model is based on a
distributed vector representation of entities and relationships.
On the other hand, there are many studies on assisting
TransH [27], TransR [28] and TransSparse [29] are a series of
developers in developing projects better from different aspects. improved models based on TransE. In addition, Perozzi et al.
The most usual scenario is assisting developers in programming. propose an embedding approach called DeepWalk, which uses
Nowadays, the IDE (Integrated Development Environment), the SkipGram model to predict the embedding vector of a
such as Intellij, can only complete and check developers’ code current node through adjacent joints [30].
in a limited way [14]. Linn et al. propose an approach to In summary, in the field of source code analysis in open
detecting reusable source code in source code projects and source communities, most of the studies are currently focusing
helping developers directly fill in the code detected using on analyzing the code itself. The analysis methods mainly use
templates [15]. The experiment results show the efficiency of information such as syntax tree and other static analysis results.

826

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on October 14,2024 at 11:41:37 UTC from IEEE Xplore. Restrictions apply.
The studies on learning-related recommendation for developers On the other hand, we also need to extract the knowledge
mostly focus on how to help developers in programming, such from the source code of the open source project. As we
as recommending source code snippets, understanding the analyzed in Section II-A, compared with dynamic analysis,
developer’s intentions and intelligent programming tools. static analysis is fast, convenient, and less dependent, and is
However, for developers, the biggest obstacle preventing them more suitable for our method. This paper uses a static code
from contributing to open source communities is not how to analysis method to extract knowledge from the project
write code, but how to understand open source projects. Also, code.Using static analysis, we can obtain the information of the
one important thing to notice is that there is a lot of code-related program code itself from the source code of an open source
knowledge in open source communities. Integrating the project, including the functions, files, and classes in the project
knowledge and recommending learning paths for open source code, as well as relationships between them, namely function
projects can help developers get familiar with open source call and inclusion relationship. We complete the static analysis
projects more effectively. using Doxygen, which is a popular, multi-language supported,
cross-platform static analysis tool. Doxygen can take source
III. METHOD OVERVIEW code of a project as input and extract static analysis information
We propose a method which recommends learning paths of such as function call relationship, the structure of files, the
open source projects based on knowledge graphs for developers. attributes of functions, etc. Taking the function call relationship
We will introduce the detail of the method in the following. as an example, Doxygen analyzes the code for each file and
A. Method Framework each project module, and generates partial function call
relationship subgraphs respectively. These function call
Our method recommends a proper learning path for subgraphs are described and outputted in Dot language
developers to help them better understand the specific functions respectively [31], which provides a simple way to describe
they want to learn. Fig. 1 shows the framework of the method.
graphics.The definition and construction of knowledge graphs:
This method is composed of three main parts: open source
project knowledge graph construction, open source project In order to build a knowledge graph of an open source project,
knowledge graph analysis, and automatic recommendation of we must define the schema of the graph at first. The Schema of
learning path. the knowledge graph is the specification of the knowledge in
the graph. Pre-designing schema help standardization, which
B. Open Source Project Knowledge Graph Construction finally facilitate the subsequent processing and querying of
knowledge. Fig. 2 shows the schema designed for the
knowledge graph of an open source project in the method.

Fig. 1. The framework of the method

In order to implement the proposed method, we need to Fig. 2. Schema of a knowledge graph
construct the knowledge graph of an open source project at first.
Since this paper is mainly aimed at the developers who want
We construct the knowledge graph with multiple data sources,
to learn and understand open source projects, the entities in the
which can help model relationships between functions better.
knowledge graph mainly contain various knowledge that is
The graph construction includes two parts: the data collection
related to development, such as functions, files, commits, issues
of an open source project, the definition and construction of the
and pull requests, etc. The relationships between different kinds
open source project knowledge graph.
of entities are also extracted. Table I shows the entities and their
1) Data collection: Open source projects in an open source
descriptions during the knowledge graph extraction of open
community have accumulated huge amounts of data. This data
source projects in this paper. Similarly, Table II shows the
contains not only the code and the documents but also a lot of relationships and their descriptions. The relationships are
data generated by developers during development processes. displayed in the form of Subject-Predicate-Object (SPO) triples.
Taking a widely used open source community GitHub as an In SPO triples, Subject has a Predicate relationship to Object,
example, the information involved in its open source projects where Subject is an entity, and Object can be an entity or an
about development can mainly be divided into Commits, Issues attribute.
and Pull Requests. We collect all of the above information to When developers read and learn the learning materials of
help recommend learning paths. open source projects, such as blogs and discussions, they often

827

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on October 14,2024 at 11:41:37 UTC from IEEE Xplore. Restrictions apply.
encounter the problem that the learning materials and the actual composed of entities (nodes) and relationships (different types
project versions do not match. The common reason why this of edges). Each edge is represented as a triple, namely (object,
problem shows up is that the project is updated too fast and the relationship, subject). Although it is effective in representing
learning materials cannot keep up with the update speed of the structured data, the underlying symbolic nature of such triples
versions. Another reason is that a specific scenario requires often makes knowledge graphs difficult to be manipulated.
referring an old version of the project. This problem may cause In order to solve the problem above, researchers have
the developers to learn the knowledge of the project not easily. proposed a new research direction: knowledge graph
Therefore, this paper extracts different stable versions of an embedding. The key idea is to embed components of the
open source project during data extraction. Open source knowledge graph, including transforming entities and
projects use the Tag function in the version control system to relationships into a continuous vector space. This method can
mark the submission location of each release in the version simplify the operation on the knowledge graph while preserving
submission records during development. Therefore, we can roll the original structure of the knowledge graph. The embedding
back the project to a specific version and execute the data of these entities and relationships can be further used in various
extraction process separately to obtain the knowledge graphs of tasks, such as knowledge graph completion [32], relationship
different versions for subsequent analysis. extraction [33] [34], entity classification [35] [36] and entity
resolution [36] [37]. In our method, the embeddings are used to
TABLE I. TYPE OF ENTITIES
calculate the semantic distances between functions and help
recommend learning paths. To represent the knowledge graph
Entity Description
using vectors, One-Hot vectors are used. However, the
representing a function in the code of an open source
func
project dimension of the One-Hot vectors is too high, and it cannot
file representing a file in an open source project
express the similarity between similar entities or relationships.
Therefore, researchers use distributed representations to
representing a submission record in the submission history
commit
of an open source project
represent entities and relationships in the knowledge graph. The
representing a collection of questions and comments of an TransE model is a graph embedding model which can express
issue
open source project in the open source community a triple as its corresponding embedding.
pull request
representing a merging request from an open source In our method, the graph embedding model TransE is
project in the open source community
trained to generate an embedded representation of the
TABLE II. TYPE OF RELATIONS knowledge graph of the open source project, and the embedding
Relation Description vector is generated for each node. These embedding vectors can
(sub, func_call, obj)
The sub function calls the obj represent the position of the entity in the embedded space for
function the knowledge graph of the open source project. The
(sub, file_contain_func, obj) The sub file contains the obj function subsequent algorithms are based on the distance calculated in
(sub, commit_change_file, obj)
The sub submission record modified the embedded space between entities, that is, the distance
obj file weight of the relationship between the entities. Although the
The sub issue involves the obj
(sub, issue_relate_commit, obj) learning path recommended by our method contains only
submission record
functions, other entities are also embedded to better calculate
(sub, issue_relate_issue, obj) The issue involves the obj issue
the semantic distances between each pair of functions. This
(sub, issue_relate_pr, obj) The issue involves the obj pull quest paper uses the OpenKE framework for model training. OpenKE
(sub, pr_relate_commit, obj)
The sub pull request contains obj is an open source framework for knowledge embedding
commit organized by THUNLP based on TensorFlow toolkit [38] [39]
The sub pull request contains obj
(sub, pr_relate_commit, obj)
commit
[40]. The OpenKE framework provides a fast and stable toolkit,
including the most popular KRL (knowledge representation
(sub, pr_relate_file, obj) The sub pull request contains obj file
learning) method [41].
C. Open Source Project Knowledge Graph Analysis 2) Multi-version knowledge fusion: There are many data
sources in an open source community, and a lot of knowledge
After the above data extraction and knowledge graph
construction of an open source project, we generate original in the field of source code. With the development and iteration
knowledge graphs for different versions of the open source of project code, the expression, data format, and the consistency
project. The knowledge graphs contain all knowledge triples of the knowledge may not keep consistent. The multi-source
from different data sources. Next, we need to analyze the open knowledge needs to be extracted, disambiguated, and integrated.
source project knowledge graph and train a graph embedding At the same time, after the program is developed and iterated,
model based on deep learning. A deep-learning-based graph the potential knowledge of the community cannot be mapped to
embedding model embeds every entity into a low-dimensional the latest version of the code. The above problem makes novice
vector, which is used by the recommendation algorithm we developers unable to map the content of the information to the
propose later for calculating distances between knowledge actual code, and the development experiences cannot be shared
graph nodes. and saved. For the knowledge from different versions of the
1) Knowledge graph embedding based on deep learning: same project, entity names may not be aligned. Therefore, we
The knowledge graph is a multi-relationship graph which is need to integrate the generated knowledge graphs for the multi-

828

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on October 14,2024 at 11:41:37 UTC from IEEE Xplore. Restrictions apply.
version open source project by fuse different versions of source code from a proper learning entrance. Therefore, we first
knowledge graphs before training the graph embeddings. need to analyze the entrance for developers to start learning.
We fuse knowledge graphs of different versions with a Since the entrance needs to be at the outermost layer of the
heuristic method. Considering that the entities in the knowledge project code structure, the ingress of its node in the knowledge
graphs are all programming-related and are extracted directly graph should be 0. However, the nodes with a degree of 0 may
from open source projects, the semantic deviations among the also contain some nodes which are not related to the main
same knowledge from different versions will not be as large as functionality of the project and cannot reach plenty of other
ones in common knowledge graphs. We assign entity nodes in nodes. Using these nodes as entrance may cause recommending
different version of graphs with a same name the same node id, improper learning paths. In order to distinguish these nodes, we
and link knowledge entities with same id during the fusion of further divide 0-indegree nodes according to the characteristics
all versions of the knowledge graphs. In this way, old of the entrance. For developers, while learning repeatedly, they
knowledge is linked to new knowledge by fusing different usually hope to learn more knowledge through one learning
versions of the knowledge graphs. For developers who want to
entrance to reduce their learning effort during their learning
learn the knowledge graph, old knowledge still has the value
process. Besides, considering that the entrance extracted will be
for learning. Therefore, the knowledge graph after fusion will
taken as the input of the following learning path
contain all versions of the knowledge set. Besides, for each
version of the unique knowledge entities and relationships, we recommendation algorithm, the more nodes the entrance is able
will add the ‘gVersion’ attribute to it to indicate which version to reach, the more possible we can reuse it while recommending
of the open source project it comes from. learning paths for other target functions. Because of the above
reasons, the entrance node needs to be able to reach plenty of
D. Automatic Recommendation of Learning Path nodes, and its own reachable node set should be as much as
Algorithm 1 Learning path recommendation algorithm possible. At the same time, the total number of recommended
entrances should not be too large, because too many
Input: entrance set A , target function B , graph G = (V, E)
recommended entrances will result in an impact on the
Output: path (𝐍𝒊 , 𝐍𝒊"𝟏 , ⋯ , 𝐍𝒊"𝒏 ) ∧ 𝑵 ∈ 𝑮
readability of recommendation results. Therefore, to ensure that
1. path_list = [ ]
the number of entrances does not affect readability, we count
2. For entrance in A : and filter out the nodes that meet the two conditions above and
3. path_list.append(dijkstra_path(G, entrance, B)) consider the nodes which rank as high as possible as the
4. End entrance nodes for learning. In this way, the method generates
5. last_dominate_num = 0 a list of entrances used in subsequent algorithms.
6. For cur_path in path_list: 2) Learning path recommendation algorithm: In order to
7. source = cur_path.first help developers understand the project code more easily, we
8. current = dfs_tree(𝐆,source).number_of_nodes() propose a learning path recommendation algorithm for project
9. If current > last_dominate_num: code. Algorithm 1 shows the pseudo-code of the learning path
10. last_dominate_num = current recommendation algorithm.
11. path = cur_path First, we abstract the learning path analysis problem into a
12. End
path search problem from multiple source points to a single
13. End
target point. When developers learn the target function through
the learning path, they can save more learning time if the
14. return path
learning path can involve the main logic of the program and can
After the above steps, we have constructed and further link more functions. Therefore, we need to choose the most
processed the generated knowledge graph of an open source extensive path from multiple reachable paths as the final
project. Then we will implement the learning path learning path. The algorithm first uses the Dijkstra algorithm to
recommendation based on this knowledge graph. The perform a path search from each entrance to the target function.
recommendation method is divided into two parts: learning Then, for each path, the path with the most nodes in the
entrance analysis and learning path recommendation algorithm. coverage tree is selected as the recommended learning path.
Learning path: The learning path here is defined as: When
a developer wants to understand a specific function in an open IV. EXPERIMENTS
source project, he needs to read the function from the outermost This paper uses the open source community GitHub as the
layer of the program call relationship and follow the function data source for analysis, and selects three representative well-
call to go deep into the target function step by step, learn the known open source projects from GitHub: Lua [43],
entire function and its position in the program by the call path. Memcached [44] and TensorFlow [45] as cases for experiments.
In this process, the function call path read by the developer is These three open source projects have different code sizes,
the learning path, namely(N! , N!"# , ⋯ , N!"$ ) ∧ 𝑁 ∈ 𝐺, where different numbers of discussions in the open source community,
N is a function node in the function call graph G. and different language features. They are all well-known and
1) Learning entrance analysis: Developers who want to widely used open source projects. They are commonly
understand a specific function usually need to start reading

829

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on October 14,2024 at 11:41:37 UTC from IEEE Xplore. Restrictions apply.
recommended for learning among novice developers and open TABLE IV. ENTRANCE ANALYSIS EXPERIMENT RESULTS
source community contributors. Project Memcached Lua TensorFlow

A. Knowledge Graph Experiments Number of entrances 71 39 89

accuracy 84.5% 74.4% 67.4%


TABLE III. EXTRACTION OF TRIPLES OF OPEN SOURCE PROJECTS
Time Cost 0.003s 0.009s 0.725s
Project Memcached Lua TensorFlow
Version 1.6.0 5.4.0 2.2.0 TABLE V. LEARNING PATH RECOMMENDATION EXPERIMENT RESULTS
func_call 1062 2585 212751 Project Memcached Lua TensorFlow
file_contain_func 582 1049 1607 Path reachability Reachable Reachable Reachable
commit_change_file 3480 12614 1932697 Average number of
6.6 7.9 7
nodes
issue_relate_commit 48 \ 1553 accuracy 80% 80% 60%
issue_relate_issue 19 \ 8244 Time Cost 0.023s 0.12s 0.18s
issue_relate_pr 65 \ 2292

pr_relate_commit 946 549 66341

pr_relate_file 1138 308 98156

Time cost 14.76s 23.99s 57.26s


The latest versions of the three open source projects are
extracted from the relationship of different triples. Table III
shows the extraction results and the running time of the
extraction. The Lua project does not open the Issue function, so
there is no Issue related knowledge.
According to the extraction results, we can see that for
different dimensions of open source projects, the method in this Fig. 3. The knowledge graph visualization of Memcached project
paper extracts different relationship triples. Among them,
TensorFlow has the largest amount of project code and involves First, we verify the learning entrance analysis part, and
multiple language implementations. It takes a lot of time to analyze the project learning entrance according to the project
analyze and extract its implementation part. While using the knowledge graph. After that, the actual semantic analysis is
method of this paper to recommend learning paths, project code performed on the learning entrance obtained by the analysis,
can be cut according to the developer’s requirements to reduce and we check whether the entrance result is a real program
the overhead caused by static analysis. Fig. 3 uses the entrance available for learning. Table IV shows the results of
visualization result of the knowledge graph extracted by the learning entrance analysis experiment. Time costs are
Memcached displayed by the tool Gephi [46]. The nodes and measured on a Macbook Pro with a quad-core 2.5GHz Intel
edges of different colors in the figure represent different types Core i7 CPU and 16GB of RAM.
of entities and relationships. Next, we verify the learning path recommendation
Because the relationship between different versions and the algorithm. We select 10 different functions from three open
names of entities may be different, alignment and fusion are source projects as target functions to generate learning paths,
required. In each knowledge graph generated, there are many and conduct learning path recommendation experiments
separate subgraphs with low connectivity, these subgraphs are separately. We then check all of the recommended learning
not connected to the main knowledge graph of the project, and paths manually to judge whether a learning path recommended
we need to clean up the knowledge graph. After that, we is a proper path. The criterion of the judgement is that whether
separately extract knowledge from multiple versions of the all of the function nodes on the learning path are highly related
three open source projects, and fuse the knowledge graphs of to the main functionality of the project. We set up this criterion
different versions for the same project to obtain the final open according to the fact that highly related functions can help
source project knowledge graph. developers learn the logic of the project more quickly. Table V
B. Learning Path Experiments is the experiment results of the recommended learning paths.
From the experiment results, we can see that the method in
According to the algorithm and the knowledge graph of the
this paper can keep recommending proper learning paths for
open source project constructed in the section III, this paper
different open source projects based on the learning entrance
validates the recommendation algorithm of learning paths on
recommended. Since the method uses the embedding vector
three open source projects Memcached, Lua, and TensorFlow.
similarity between various kinds of entities as the distance for
The verification mainly includes two parts: learning entrance
path recommendation, there may be a very few function call
analysis and learning path recommendation.
links that are more similar in other dimensions to be

830

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on October 14,2024 at 11:41:37 UTC from IEEE Xplore. Restrictions apply.
recommended first, which results in that the best semantical abilities by learning the source code of open source projects. At
learning path is not recommended. Here, this paper also the same time, 100% of them are willing to learn and participate
classifies this situation as inaccurate. in the contribution of open source projects if possible, which
reflects the fact that open source projects have not only the
C. Comparative Experiments
value as an open source program, but also the value as an
TABLE VI. SURVEY QUESTIONS
important learning resource. They are widely used to help
developers learn and improve their programming skills.
Question
Question content
ID
Q1 I am familiar with C++

Q2 I can use IDE and I am familiar with IDE tools

Q3 I am familiar with using GitHub

Q4 I usually use the open source community GitHub


I have learned open source projects to improve my
Q5
programming skill
Q6 I have contributed to open source projects Fig. 4. Backgrounds of the developers and their cognition of open source
project contribution
Q7 I would like to learn open source projects if possible

Q8 I would like to contribute to open source projects if possible

Q9 I will read blogs written by others while learning a new skill


When I read blogs, the program version corresponding to the
Q10
information is often inconsistent with the one I want to use
When using open source projects, I sometimes choose not to
Q11
use the latest version
The update of learning materials is usually behind the
Q12 version update, which makes me use the new version
troublesome
The inconsistency between the version of the program in the
Q13 learning materials and the version I actually use caused Fig. 5. The influence of version inconsistence on developers
trouble to my study
In some cases, the old version of the information is still very Among these developers, 10% of them want to learn open
Q14
helpful to my learning
source projects to improve their programming skills but have
Finally, this paper conducts a comparative experiment on never completed learning. In response to this situation, we
real developers. The experiment contains two part, a conclude the biggest problems faced by those who start
background survey for real developers and a learning-path- participating in and contributing to an open source project or
finding comparative experiment. participate in an existing project. Among them, high-frequency
1) Background survey: We first conduct a survey among answers such as "I feel that early contributions to an open
developers from different development backgrounds. We find source project require a deep understanding of the open source
30 developers with different technical backgrounds and let project, which is usually very time-consuming and labor-
them complete the different learning tasks we set up intensive.", "If you want to contribute code, you must first be
respectively, so as to compare the actual learning process of the familiar with the entire project's system, that is, architecture and
developers with the results of the method in this paper. Among code structure, if the project is large enough, this process will
these developers, practitioners account for 30%, graduate be very time-consuming", "missing documentation, incorrect
students account for 43.33%, and undergraduate students comments", can reflect the lack of project documents for
account for 26.67%. Several questions are asked to survey the developers facing unfamiliar open source projects. It is difficult
developers’ backgrounds in development, their cognition of for the developers to get started.
open source project contribution and their opinions about Besides, we also ask a few questions to the developers to
learning a multi-version open source project. Table VI shows verify the influence of the problem that open source projects
the survey questions we ask these developers . update too fast and there is inconsistence between learning
We first ask the developers their backgrounds in materials and the latest version of the open source project. Fig.
development and their cognition of open source project 5 shows the result of the survey. Among these developers,
contribution and Fig. 4 shows the statistic result of their 93.33% of them learn new technologies by reading blogs
answers to these questions. Among these developers, all written by others, and 90% of the developers face the problem
developers are proficient in using IDE tools, and at least that the program version corresponding to the learning
understand and even be familiar with the use of the open source materials is inconsistent with the one they want to use, and
community GitHub. At the same time, more than 56.66% of 86.67% of them are affected by this problem. At the same time,
them have participated in the contribution of open source all developers believe that the old version of the learning
projects, and 90% of them have improved their programming materials can still help them learn and understand the latest

831

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on October 14,2024 at 11:41:37 UTC from IEEE Xplore. Restrictions apply.
version of the project. It can be seen that linking different acceptance of these developers. 80% of them believe that the
versions of knowledge are very helpful for the learning of learning path we propose is a good way to learn open source
developers, which can greatly expand the range of learning projects. The experiment results show the overall feasibility and
materials the developers can use. The above result shows that the effectiveness of the method we propose.
our practice of fusing different versions of knowledge for open
source projects is significant.
2) Learning-path-finding comparative experiment: After
the survey, through two sets of experiments on these developers,
we conduct a comparative experiment between the learning-
path-finding task without multi-version problem and the
learning-path-finding task with multi-version problem. The
first purpose of this comparative experiment is to verify
whether our recommendation algorithm can help developers
understand the project code faster and better than the
developers’ personal learning methods, namely how much time
developers can save. The second purpose is to verify whether
our method can help developers link different versions of Fig. 6. The acceptance of the concept of learning path
knowledge in a multi-version environment, thereby improve
the accuracy of learning the correct target. V. CONCLUSION AND FUTURE WORK
The first set of experiment asks the developers to get
This paper puts forwards a method which can automatically
familiar with a function named “item_lock” in the project
recommend learning paths of open source projects for
Memcached and records their time costs while doing the job.
developers using deep learning on knowledge graphs. It uses
They are asked to not only find the function, but also find a
multiple data sources in an open source community to build a
function call path from the outermost layer of the project to the
knowledge graph of an open source project. Based on a
target function, which is considered to be the learning path. The
knowledge graph embedding model and a learning path
second set of experiment asks them to do the same thing for a
recommendation algorithm, the method recommends suitable
function named “try_read_command”. However, the second
learning paths for developers. We select three well-known open
target function is not involved in the latest version of
source projects Lua, Memcached, and TensorFlow as cases to
Memcached. We can therefore measure the developers’
verify our method. The experiments verify that our method can
learning efforts while facing a multi-version environment.
greatly reduce the time for developers to learn the project code.
During the experiments, these developers can use any tools they
The experiments also verify that the method can help
are familiar with, including using search engines, using IDEs,
developers better learn about open source projects. The method
and consulting related documents. All of the developers have
makes developers contribute to open source projects more
not heard of Memcached before.Table VII and Table VIII show
easily and enhance the vitality of open source communities. In
the experiment results of the two sets of experiments. We
the future, we can further improve knowledge graphs by
calculate the accuracy of the learning paths by the same
extracting more entities and relationships from the project code.
criterion described in the experiment we propose in Section IV-
B. From the results, our method can greatly reduce the time ACKNOWLEDGMENT
spent by developers on reading and studying the project code,
This effort is sponsored by the Beijing Outstanding Young
and when facing cross-version knowledge links, it can improve
Scientist Program under the grant number
the accuracy of developers’ learning correct knowledge.
BJJWZYJH01201910001004 and the National Natural Science
TABLE VII. EXPERIMENT RESULTS WITHOUT MULTI-VERSION PROBLEMS Foundation of China under Grant No.61421091.

Method Our Method Real Developer REFERENCES


Average time 14.79s 478.48s [1] GitHub, May 2020, [online] Available:https://www.github.com
[2] Anderson, B., Quist, D., Neil, J., Storlie, C., & Lane, T. (2011). Graph-
Accuracy 80% 70% based malware detection using dynamic analysis. Journal in computer
Virology, 7(4), 247-258.
TABLE VIII. EXPERIMENT RESULTS WITH MULTI-VERSION PROBLEMS [3] Cohen, A., & Nissim, N. (2018). Trusted detection of ransomware in a
private cloud using machine learning methods leveraging meta-features
from volatile memory. Expert Systems with Applications, 102, 158-178.
Method Our Method Real Developer
[4] Thomé, J., Shar, L. K., Bianculli, D., & Briand, L. (2018). An integrated
Average time 15.21s 545.73s approach for effective injection vulnerability analysis of web applications
through security slicing and hybrid constraint solving. IEEE Transactions
Accuracy 80% 43.3% on Software Engineering.
Finally, after the above comparative experiments, we ask [5] Habib A, Pradel M. Is this class thread-safe? inferring documentation
these developers about their acceptance of the concept of using graph-based learning[C]//Proceedings of the 33rd ACM/IEEE
International Conference on Automated Software Engineering. 2018: 41-
learning path we propose. Fig. 6 shows the result of the 52.

832

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on October 14,2024 at 11:41:37 UTC from IEEE Xplore. Restrictions apply.
[6] J. Tu, X. Xie, Y. Zhou, B. Xu, & L. Chen. 2016. A Search Based Context- [24] Suchanek F M, Kasneci G, Weikum G. Yago: a core of semantic
Aware Approach for Understanding and Localizing the Fault via knowledge[C]//Proceedings of the 16th international conference on
Weighted Call Graph. In 2016 Third International Conference on World Wide Web. 2007: 697-706.
Trustworthy Systems and their Applications (TSA) , IEEE, 64-72. [25] Carlson A, Betteridge J, Kisiel B, et al. Toward an architecture for never-
[7] Ali, K., Lai, X., Luo, Z., Lhoták, O., Dolby, J., & Tip, F. (2019). A Study ending language learning[C]//Twenty-Fourth AAAI Conference on
of Call Graph Construction for JVM-Hosted Languages. IEEE Artificial Intelligence. 2010.
Transactions on Software Engineering. [26] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, & O. Yakhnenko.
[8] Gharibi G, Tripathi R, Lee Y. Code2graph: automatic generation of static 2013. Translating embeddings for modeling multi-relational data. In
call graphs for python source code[C]//Proceedings of the 33rd Advances in neural information processing systems, 2787-2795.
ACM/IEEE International Conference on Automated Software [27] Z. Wang, J. Zhang, J. Feng, & Z. Chen. 2014. Knowledge graph
Engineering. 2018: 880-883. embedding by translating on hyperplanes. In Twenty-Eighth AAAI
[9] H. Gascon, F. Yamaguchi, D. Arp, & K. Rieck. 2013. Structural detection conference on artificial intelligence.
of android malware using embedded call graphs. In Proceedings of the [28] Y. Lin, Z. Liu, M. Sun, Y. Liu, & X. Zhu. 2015. Learning entity and
2013 ACM workshop on Artificial intelligence and security, ACM, 45- relation embeddings for knowledge graph completion. In Twenty-ninth
54. AAAI conference on artificial intelligence.
[10] M. Trapp, M. Rossberg, & G. Schaefer. 2015. Program partitioning based [29] G. Ji, K. Liu, S. He, & J. Zhao. 2016. Knowledge graph completion with
on static call graph analysis for privilege separation. In 2015 IEEE adaptive sparse transfer matrix. In Thirtieth AAAI Conference on
Symposium on Computers and Communication (ISCC) , IEEE, 613-618. Artificial Intelligence.
[11] Y. Zou, C. Ling, Z. Lin, & B. Xie. 2018. Graph Embedding based Code [30] B. Perozzi, R. Al-Rfou, & S. Skiena. 2014. Deepwalk: Online learning of
Search in Software Project. In Proceedings of the Tenth Asia-Pacific social representations. In Proceedings of the 20th ACM SIGKDD
Symposium on Internetware, ACM, 1 international conference on Knowledge discovery and data mining, ACM,
[12] F. Lv, H. Zhang, J. G. Lou, S. Wang, D. Zhang, & J. Zhao. 2015. 701-710.
Codehow: Effective code search based on api understanding and extended [31] Dot, May 2020, [online] Available:
boolean model (e). In 2015 30th IEEE/ACM International Conference on https://www.graphviz.org/doc/info/lang.html
Automated Software Engineering (ASE), IEEE, 260-270.
[32] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, & O. Yakhnenko.
[13] Lin Z, Zou Y, Zhao J, et al. Improving software text retrieval using 2013. Translating embeddings for modeling multi-relational data. In
conceptual knowledge in source code[C]//2017 32nd IEEE/ACM Advances in neural information processing systems, 2787-2795.
International Conference on Automated Software Engineering (ASE).
[33] Weston J, Bordes A, Yakhnenko O, et al. Connecting language and
IEEE, 2017: 123-134.
knowledge bases with embedding models for relation extraction[J]. arXiv
[14] https://www.jetbrains.com/idea/ preprint arXiv:1307.7973, 2013.
[15] Y. Lin, G. Meng, Y. Xue, Z. Xing, J. Sun, X. Peng, ... & J. Dong. 2017. [34] Riedel S, Yao L, McCallum A, et al. Relation extraction with matrix
Mining implicit design templates for actionable code reuse. In factorization and universal schemas[C]//Proceedings of the 2013
Proceedings of the 32nd IEEE/ACM International Conference on Conference of the North American Chapter of the Association for
Automated Software Engineering, IEEE Press, 394-404. Computational Linguistics: Human Language Technologies. 2013: 74-84.
[16] S. Zhou, H. Zhong, & B. Shen. 2018. SLAMPA: Recommending Code [35] Nickel M, Tresp V, Kriegel H P. Factorizing yago: scalable machine
Snippets with Statistical Language Model. In 2018 25th Asia-Pacific learning for linked data[C]//Proceedings of the 21st international
Software Engineering Conference (APSEC), IEEE, 79-88. conference on World Wide Web. 2012: 271-280.
[17] Sun Z, Peng F, Guan J, et al. An approach to helping developers learn [36] Nickel M, Tresp V, Kriegel H P. A three-way model for collective
open source projects based on machine learning[C]//Proceedings of the learning on multi-relational data[C]//Icml. 2011, 11: 809-816.
11th Asia-Pacific Symposium on Internetware. 2019: 1-10.
[37] Bordes A, Usunier N, Garcia-Duran A, et al. Translating embeddings for
[18] S. Prabhakar, G. Spanakis, & O. Zaiane. 2017. Reciprocal recommender modeling multi-relational data[C]//Advances in neural information
system for learners in massive open online courses (moocs). In processing systems. 2013: 2787-2795.
International Conference on Web-Based Learning, Springer, Cham, 157-
[38] X. Han, S. Cao, X. Lv, Y. Lin, Z. Liu, M. Sun, & J. Li. 2018. Openke: An
167.
open toolkit for knowledge embedding. In Proceedings of the 2018
[19] Y. Dai, Y. Asano, & M. Yoshikawa. 2016. Course Content Analysis: An Conference on Empirical Methods in Natural Language Processing:
Initiative Step toward Learning Object Recommendation Systems for System Demonstrations, 139-144.
MOOC Learners. International Educational Data Mining Society.
[39] Nickel M, Rosasco L, Poggio T. Holographic embeddings of knowledge
[20] H. M. Chang, T. M. L. Kuo, S. C. Chen, C. A. Li, Y. W. Huang, Y. C. graphs[C]//Thirtieth Aaai conference on artificial intelligence. 2016.
Cheng, ... & J. W. Tzeng. 2016. Developing a data-driven learning interest
[40] Trouillon T, Welbl J, Riedel S, et al. Complex embeddings for simple link
recommendation system to promoting self-paced learning on MOOCs. In
2016 IEEE 16th International Conference on Advanced Learning prediction[C]. International Conference on Machine Learning (ICML),
Technologies (ICALT), IEEE, 23-25. 2016.
[41] Wang Q, Mao Z, Wang B, et al. Knowledge graph embedding: A survey
[21] Y. Pang, C. Liao, W. Tan, Y. Wu, & C. Zhou. 2018. Recommendation for
MOOC with Learner Neighbors and Learning Series. In International of approaches and applications[J]. IEEE Transactions on Knowledge and
Conference on Web Information Systems Engineering, Springer, Cham, Data Engineering, 2017, 29(12): 2724-2743.
379-394. [42] Blondel V D, Guillaume J L, Lambiotte R, et al. Fast unfolding of
[22] Bollacker K, Evans C, Paritosh P, et al. Freebase: a collaboratively created communities in large networks[J]. Journal of statistical mechanics: theory
graph database for structuring human knowledge[C]//Proceedings of the and experiment, 2008, 2008(10): P10008.
2008 ACM SIGMOD international conference on Management of data. [43] Lua, May 2020, [online] Available: https://github.com/lua/lua
2008: 1247-1250. [44] Memcached, May 2020, [online] Available:
[23] Lehmann J, Isele R, Jakob M, et al. DBpedia–a large-scale, multilingual https://github.com/memcached/ memcached
knowledge base extracted from Wikipedia[J]. Semantic Web, 2015, 6(2): [45] TensorFlow, May 2020, [online] Available:
167-195. https://github.com/tensorflow/ tensorflow
[46] Gephi, May 2020, [online] Available: https://gephi.org/

833

Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on October 14,2024 at 11:41:37 UTC from IEEE Xplore. Restrictions apply.

You might also like