0% found this document useful (0 votes)
19 views3 pages

1 - PDFsam - Data Streaming Architecture Based On Apache Kafka and Github For Tracking Students

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views3 pages

1 - PDFsam - Data Streaming Architecture Based On Apache Kafka and Github For Tracking Students

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

132 Project-based Learning and E-learning

Data streaming architecture based on Apache Kafka


and GitHub for tracking students’ activity in higher
education software development courses
Milan Miloradović Ana Milovanović

Department for e-business Department for e-business


The University of Belgrade, Faculty of organizational sciences The University of Belgrade, Faculty of organizational sciences
Belgrade, Serbia Belgrade, Serbia
[email protected] [email protected]
https://orcid.org/0000-0003-4101-2350 https://orcid.org/0000-0001-5282-881

Abstract — Data streaming architecture can be used to practices into account, possible architecture is formed. The
collect data and gain insights into the dynamics of individu- purpose of this paper is to investigate the integration of
al or collaborative software development activity that takes the version control platform (GitHub platform) with the
place in higher education courses. There is a place to further data streaming platforms such as Kafka to process events
investigate streaming architecture in a given context. The
generated while conducting higher education software en-
code versioning platforms, such as GitHub, serving as data
sources in the existing implementations of data streaming ar- gineering courses. This use case is lacking in practice and
chitecture are lacking in practice. The goal of this paper is to could be beneficial for further related work as far as higher
investigate the implementation of a custom data streaming education is concerned.
architecture that could be used to track real-time students’
analytics in higher education software development courses. The solution gives a presentation of a data streaming
The solution is based on Apache Kafka and GitHub plat- architecture based on Apache Kafka and the GitHub plat-
forms. Also, the architecture developed in the paper could form. Besides, GitHub webhook concept is described, as
be considered when planning on integrating LMS (Learning well as the flow of communication between Kafka produc-
Management System) as a visual web interface for students’ er and Kafka consumer. In the proposed solution, the com-
analytics. munication between Kafka producer and Kafka consumer
Keywords — data streaming architecture, Apache Kafka, starts with a GitHub webhook event.
GitHub, higher education, software development, Learning
Kafka producer and Kafka consumer are implemented
Management System (LMS)
using the Java Spring Boot framework. Java Spring Boot
is an open-source, microservice-based Java web frame-
I. INTRODUCTION work [4]. The microservice architecture provides develop-
ers with a fully enclosed application, including embedded
Data streaming architecture is based on the concept of application servers [4].
events [1]. Event Streaming Platforms, which are based
on the data streaming architecture, provide the infrastruc- The question of integrating Learning Management sys-
ture that enables software to react in real-time to the giv- tems (LMS) such as Moodle LMS into the architecture for
en events [1]. Apache Kafka is an open-source streaming a unified dashboard preview of students’ analytics is also
platform that can make use of producers which are appli- considered.
cations that are sending messages to the Kafka broker [2].
Kafka Broker stores messages that are later accessed by II. LITERATURE REVIEW
consumers [2].
As the authors state in [5], the GitHub platform pro-
Git represents a system that enables tracking chang-
vides insights into social coding activities. Apart from the
es to the user files and it is considered a Version Control
popular usage of the GitHub platform in the software de-
System (VCS) [3]. GitHub platform relies on git and its’
velopment industry, this is also the reason to consider us-
commands to perform version control of user files.
ing GitHub as a collaborative software development plat-
The aim of this paper is to provide a possible solution form [6] in higher education setup and as a data source in
to the data steaming architecture that would be used to col- data streaming architecture.
lect and process data on students’ activity that takes place
GitHub data analysis done in [7] demonstrates the pos-
on version control platforms in higher education software
sibilities of data generated through events on the GitHub
development courses.
platform. Different analytics are considered including the
Upon considering different use cases of implementing number of commits per contributor and SNA (Social Net-
data streaming architecture and taking good architectural work Analysis) analysis [7]. Those analytics could also be
considered when implementing Kafka consumer.
2022 International conference on E-business technologies (EBT) 133

Some of the research papers that are dealing with the Another usage of data streaming is implemented in
integration of code versioning platforms into the curricu- CERN HSE (occupational Health & Safety and Environ-
lum of the software development university courses have mental protection) Unit [16] which deals with the imple-
relied on the GitHub platform and its’ functionality to gain mentation of the CERN Safety Policy. Researchers de-
insight into students’ activities [8][9][10]. However, using veloped REMUS (Radiation and Environmental Unified
GitHub as a data source provider to deal with real-time Supervision) system that is using an open-source Apache
stream processing and analytics in an educational environ- Kafka streaming platform to stream real-time data to their
ment is lacking in practice. Web Interfaces and Data Visualization Tools [16].

New streaming technologies that are available today


handle stream data with high performance with a message III. METHODOLOGY
throughput of millions of messages per second [2]. Data
platforms handle data from different sources and stream In order to create custom data streaming architecture
data to different consumers [2]. based on Apache Kafka and GitHub for tracking students’
activity, GitHub webhook, Kafka producer and consumer
Events in the Kafka ecosystem are assigned to topics are used.
[11]. Those topics are holding different numbers of logs
(shards or partitions). The number of shards is configur- It is necessary to identify components of the data
able, thus scalability in the Kafka ecosystem is provided streaming architecture. Identified components are as fol-
[11]. lows (Table 1):

The use cases of deploying data streaming architecture ■■ GitHub users


are quite diverse in education. In [12] authors have devel- ■■ GitHub webhook
oped a cloud-based e-learning platform to provide educa- ■■ Kafka producer
tional content for the agricultural community. The stream- ■■ Kafka cluster
ing analysis here is employed in real-time using Apache
Kafka and Apache Spark to provide high-quality video
■■ Kafka consumer
content to users and to control server resources. Kafka producer and consumer are implemented using
the Java Spring Boot framework.
A popular area to employ streaming platforms is cer-
tainly IoT (Internet of Things). In the research presented Table 1. Identified Components
in [13], the authors developed a course that is called Net-
Data streaming architecture
work-of-Things Engineering Lab (NoteLab). The course No.
combines IoT, edge and cloud computing and is dealing Components Example
with the implementation of interfaces and protocols that
connect the entire system (MQTT, COAP and HTTP) [13]. Student working on GitHub
1. GitHub user
repository (push event).
Kafka is used as a connector, consisting of a Kafka broker
among other components, using publish/subscribe proto-
col to connect with Kafka producers and Kafka-Firebase Mechanism integrated into
connector to connect to Firebase real-time database [13]. 2. GitHub webhook GitHub organization of higher
education institution (elab).
Machine learning is another popular area to take into
consideration when investigating the usage of streaming The application that broadcasts
3. Kafka producer
platforms. In [14] Apache Kafka is utilized to implement messages.
a stream processing system based on a publish-subscribe
pattern. The system developed in [14] deals with vehicle
detection based on its’ attributes such as color, speed and Its’ task is to process and organ-
4. Kafka cluster
ize the data.
type. There are two main steps, the first being getting the
vehicle information from a video, and the second step is
streaming that information to subscribers. The application that receives
5. Kafka consumer
messages.
In [15] by making use of DevOps and cloud comput-
ing “continuous quality assurance approach” is facilitated
[15]. The prototype of the solution consisted of detecting The first part of the architecture is the GitHub plat-
problems that are likely to happen when the new versions form. As already stated, it is a platform for version control
of the microservices are being in building state, deployed, and is usually used for various social coding [5] activi-
or targeted with requests [15]. The data crawler observes ties. In this particular case, the GitHub platform serves to
all microservice containers that are running, collects the broadcast events that are created within the GitHub organ-
relevant data and sends the data to the Apache Kafka clus- ization. Some of the event types that should be considered
ter. Data is further being fetched and indexed in Elastic- are: push, pull request, merge, view, commit, etc.
search [15].
134 Project-based Learning and E-learning

GitHub users, in this particular example, are students


within the GitHub organization, which generate events
over their repositories. When one of these events happens,
the GitHub webhook [17] is triggered as shown in Fig. 1.

After that, the GitHub webhook forwards the POST re-


quest to the Kafka producer, which further determines to
which topic to send the event.

The second part of the architecture is the Apache Kafka


(Fig. 1). Apache Kafka consists of a Kafka cluster, Zoo-
keeper server, as well as producer and consumer. Kafka Fig. 1. Proposed Data streaming architecture dia-
producer and Kafka consumer are applications where one gram
application broadcasts (producer) messages and the other
one receives messages from the broadcaster (consumer) An example (TABLE I) of an event, which is triggered
[18]. from a GitHub webhook, is shown in the figure (Fig. 2).

Kafka cluster's task is to process and organize the data


that is passed to it. Each Kafka cluster contains a list of
topics, where incoming messages are redirected. A topic
is made up of one or more partitions, and each partition is
an edited, immutable sequence to which new records are
constantly added [18].

Based on the GitHub event type that is being pro-


cessed by the Kafka producer, messages could be passed
to a different topic. For example, for each commit event
being processed, the Kafka producer sends a message to
the topic Topic 1 (Fig. 1). For each pull event, the Kafka
producer can be configured to send a message to the topic
Topic 2 (Fig. 1) and for each merge event to the topic Top-
ic 3 (Fig. 1).
Fig. 2. Push event
Identified topics are also given in the table (Table 2):
Push event (Fig. 2) contains data about event id, type,
Table 2. Identified Topics actor (in this case the student who generated the event), re-
pository (repo), GitHub organization (org), and date (cre-
Kafka topics ated_at) when the event is created.
No.
Topics Example
Kafka's producer's task is to forward push events to a
1. Topic 1
Topic which contains only com- certain topic (event_topic) as shown in Fig. 3. Once the
mit events.
data arrives at the topic, the Kafka consumer reads the
Topic which contains only pull data. It is possible that the consumer application, upon
2. Topic 2 reading the data, performs analytics and presents the data
events.
through the web interface. The web interface could be in-
Topic which contains only merge tegrated with Moodle LMS, but this option needs to be
3. Topic 3
events.
investigated further.
Kafka broker is an instance of a Kafka cluster. It re-
ceives messages from the producer, assigns an offset to
each message, and then stores those messages on a disk
[18].

Zookeeper is a site management and performance


monitoring tool in the Kafka cluster [18].

In the end, the Kafka consumer application reads data


from the topic and students’ activity analytics are calculat-
ed. Later these analytics could be presented in some web
interface such as dashboard preview.

Fig. 3. Kafka producer implementation

You might also like