Implementationof Automated Annotationthrough Mask RCNNObject Detectionmodelin CVATusing AWSEC2 Instance
Implementationof Automated Annotationthrough Mask RCNNObject Detectionmodelin CVATusing AWSEC2 Instance
net/publication/347866270
CITATIONS READS
26 1,745
7 authors, including:
All content following this page was uploaded by Marielet Guillermo on 03 January 2021.
Abstract—With machine learning-based innovations issues [5]. For instance, a proposed system for detecting and
becoming a trend, practical resolutions of its implementation to classifying attacks or intrusions using artificial neural
large-scale data and computing problems must be able to cope networks was examined to have provided an outstanding
up as well. Currently, Graphic Processing Units (GPUs) are performance when run on Amazon Web services, one of the
being chosen over other available physical devices due to its leading cloud service providers globally. The novel system is
powerful computing capability and easier handling. Several said to be powerful, more accurate and precise according to a
cloud service providers also made it possible for these to be study. More systems like this, especially with application of
accessible online allowing higher serviceability and lower cost Robotics and Artificial Intelligence, can possibly be built
upfront for businesses. With this said, the proponent would
more portable and with battery life efficient capability if the
implement a common machine learning-based application,
automated annotation through Mask RCNN Object Detection
power behind the cloud could be harnessed. This is because
Model in CVAT, using AWS instance. The key purpose is to there wouldn’t be a need to have a powerful computer on
showcase the viability of deploying data and computing board. The brain of the system and other computing tasks can
intensive system on the cloud. be left in the cloud [5][6].
Automated annotation in Computer Vision Annotation
Keywords— video processing, convolutional neural network, Tool (CVAT) like any other machine learning related
annotation, CCTV, AWS, computing power, instance, CUDA,
applications is both compute and data intensive especially
GPU, CVAT
with convolutional neural networks (CNN) [7][8]. Although
I. INTRODUCTION this algorithm delivers state of the art performance, there is a
major problem in terms of complexity and time consumption.
Hardware requirements for running a machine learning- Deploying an automated annotation tool on the cloud will
based system typically are superior and costly. This is for the lead to higher serviceability and faster annotation. On the
reason that these systems are often data and compute other hand, system administrators will benefit from easier
intensive. However, challenges such as obsolescence and setup and maintenance on the cloud.
poor software lead to an upheaval when using a dedicated
hardware [1]. The proponent would implement a machine learning-
based application on the cloud using Amazon Web Service
With the advantages Graphic Processing Units (GPUs) instance. This study aims to showcase the viability of
can offer, it has become a trend nowadays to utilize it over deploying data and computing intensive system on the cloud.
dedicated hardware. Among its favorable use are numerous For this study, automated annotation through Mask RCNN
configuration and design opportunities including memory Object Detection Model in CVAT will be tested.
bandwidth, core frequency, and number of parallel compute
units [2]. Aside from the fact that the maintenance is less of a II. REVIEW OF RELATED LITERATURE
struggle, there is also much improvements in computing
Machine learning has taken a huge leap forward in
processes under the support of GPU-enabled platforms [3].
innovative solutions nowadays that there has to be an increase
NVIDIA specifically accelerated computing task with its new
in practical resolutions of its implementations to large-scale
immense parallel architecture and programming model
data problems [4][9]. GPUs have played a key role in the
“CUDA” wherein systems are being revised to take the
success of deep learning through significant reduction in the
compute-intensive kernels and map to the GPU. Conversely,
training time [10].
the rest of the system processes remain on the CPU [4].
Concurrently, the existence of cloud computing became
wide-ranging. Numerous applications and algorithms which
cannot be made workable beforehand became suddenly
viable with the elimination of time and power consumption Fig. 1. Graphic Processing Units (GPUs)
Several studies were conducted on systems’ performance tool not only generates various ground truth (GT) information
utilizing GPUs. such as object, motion, and event information, but also
supports a semi-automatic video and image annotation
method for fast generation of ground truth. The
implementation results show that the proposed annotation
tool provides faster and more detailed ground truth
information compared to the existing methods [21][22].
The evaluation was directed while changing the number Fig. 4. Structure of semi-automatic annotation system
of layers, i.e., 4, 6, 8, and 10, and as presented in the
performance results above, the GPU-enabled platform In this structure, the processing speed is accelerated by
delivers about 3 times better performance in terms of speed automating the object extraction and analysis process, which
than that of the CPU-only settings. Moreover, the requires the most time and manpower for generating ground
performance with GPU-enabled platform shows lower truth information.
variances among the 50 iterations, which verifies that the In another study about security monitoring and
GPU-enabled computing is more stable [3][11]. While the autonomous driver assistance systems, a semi-automatic
performance is proven to be exceptional, it comes with a multi-object video annotation based on tracking, prediction,
higher cost and maintenance. Cloud services are a speeding and semantic segmentation was proposed [23]. Below is a
research ground leaving a deep impact to many critical high-level view of a video annotation pipeline and a
research areas, ranging from software to communication, demonstration Graphical User Interface.
from server platforms to mobile endpoints [12]. Using cloud
computing resources with GPU devices in systems needing
extensive parallelism, such as deep learning, is something to
be worth investing [13]. Robotics services begun to be built
around cloud computing and service-oriented architecture
paradigms [14][15]. Real-time farm monitoring, large speech
recognition tasks, and even manual image annotation tasks
are being arranged on the cloud. All for the purpose of easier
setup and higher availability anytime, anywhere [16] [17]
[18].
An EC2 instance of AWS provides scalable computing
Fig. 5. General Video Annotation Framework
capacity eliminating the need for a business especially the
startup ones to invest for hardware devices up front [19][20].
It also comes with machine images which package everything
that’s needed for a server including the operating system.
Several machine images now also include deep learning
frameworks such as Caffe and Mxnet. Other settings can be
configured such as CPU, memory, and networking
capacity.Annotation task is one of the vital parts of doing
innovations involving artificial intelligence. It allows
marking of region of interest in an image or video dataset.
This is a simple yet crucial task as this is where accuracy of
the system output dominantly relies on. The process of
annotating video or just video frames is significantly more Fig. 6. Demonstration Graphical User Interface
challenging than the annotation of images. A 5-minute video
for instance contains between 9,000 and 18,000 frames, on an The last section in the image above shows the segmented
average at a rate of 30–60 frames per second [8]. Hence, this map of the images being processed. This map is automatically
task becomes a significant roadblock for innovators. Several generated by the semantic segmentation algorithm. Several
resolutions were proposed, among which, crowdsourcing is models applied to machine vision were at hand including
said to be of greatest help so far. This approach solves the above algorithm. Nevertheless, Deep Neural Networks
problem of scarcity in human workforce to do quality (DNNs) made an outstanding performance among those [24]
annotation. [25] [26].
Another effort for generating ground truth information is With information discussed in this chapter, the proponent
an interactive video annotation tool. The proposed annotation deems it appropriate to implement a computer vision [27]
[28] annotation tool with an automated annotation task based Firewall rules must be set to control the traffic for the
on neural network model on AWS EC2 instance. This is to instance. The following are minimum requirements for this
serve as a deployment guide of a relevant and timely study:
innovation on the cloud.
III. METHODOLOGY
For successful demonstration of Automated Video
Annotation through Mask RCNN Object Detection Model on
the cloud, three major phases will be executed. Each phase
involves subtasks to accomplish the desired output. Refer to
block diagram below:
Fig. 9. Firewall Rules
EC2 Instance
configuration ● SSH – allows specific people to remotely access the
instance through their device IP address.
● HTTP/ HTTPS – allows internet traffic to reach instance
via web ports.
CVAT Setup ● ICMP – allows specific devices to check the reachability
of an instance for troubleshooting purposes.
● Custom TCP – allows internet traffic to access specific
applications via ports within 80-8080. Two of IP
Automated addresses in this range will be assigned on CVAT admin
Annotation and user web interface.
Fig. 7. Block Diagram
Once instance is successfully launched, it can be seen
A. EC2 Instance Configuration from the list of available instances whether the status is
running or stopped.
Amazon Machine Image (AMI)
The first step to consider in setting up an instance is the
AMI to be used. It is a like an ISO image that contains
everything that would be written in the instance such as
operating system, applications, and other additional libraries.
From the list of available AMIs, there are deep learning Fig. 10. CVAT instance
images with relevant drivers but Ubuntu Server 18.04 LTS
(HVM), SSD Volume Type would be enough. Below it are the important instance details especially the
Instance Type public DNS and private IP. For this study, the URL is ec2-
54-152-192-248.compute-1.amazonaws.com. You can
Instances are virtual servers that can run applications. remotely connect with your instance with the following
They have varying combinations of CPU, memory, storage, command: ssh -i <location of your .pem file keypair>
and networking capacity, and give you the flexibility to
ubuntu@yourpublicDNS.
choose the appropriate mix of resources for your applications.
The proponent used Amazon EC2 P2 Instance as it is Example: ssh -i C:/Administrator-key-pair-useast1.pem
designed for general-purpose GPU compute applications
[email protected].
using CUDA making it ideal for the system being deployed.
p2.xlarge instance type was chosen; specifications can be B. CVAT Setup
found in the figure below where the GPU is an NVIDIA K80
type. In this phase, docker installation needs to be done using a
terminal window before CVAT can be used. A step by step
guide can be found here:
https://github.com/opencv/cvat/blob/develop/cvat/apps/docu
mentation/installation.md.
Fig. 8. P2 instance details Note: The following command needs to be issued prior
using CVAT to ensure that auto annotation feature will run
Storage over Mask R-CNN model and will utilized both GPU and
By default, an Elastic Block Store (EBS) with general CPU of the instance.
purpose SSD and 8 GiB volume type and size respectively is
configured. The size needs to be increased to at least 1000 for
seamless running of CVAT in the instance. Alternatively, Fig. 11. Build docker image and run component code
additional EBS may be attached or a Simple Storage Service
(S3) bucket. File with name “docker-compose.override.yml” is an
additional file that needs to be created so that the instance of
Security Group
CVAT can be accessed outside the localhost. Here, all other
extra settings can be added but the most important part is tod
declare instance public DNS as the CVAT host. If
successfully setup, you should be able to access your CVAT
already through a browser. For optimum experience, Google
Chrome is recommended. URL format:
yourpublicDNS:portNo./admin for administration panel and
yourpublicDNS:portNo./login for user panel.
C. Automated Annotation
Annotation Task Creation
Fig. 14. Video No.1 Task 1
To be able to test deep learning-based annotation, tasks
need to be created first. Depending on the user’s access
privilege, one can either perform creation of task or perform
the annotation task itself or both. The complete guide in
creating an annotation task can be found here:
https://github.com/opencv/cvat/blob/develop/cvat/apps/docu
mentation/user_guide.md#creating-an-annotation-task.
The proponent created a total of ten (10) tasks, five (5)
each for two (2) 1-minute videos. These videos are CCTV
footages of vehicles in different roads and viewpoints. For the Fig. 15. Video No.1 Task 2
labels, the following were assigned: car, vehicle, truck,
motorcycle, bus, and bicycle. This means that the model will
only identify and classify according to the indicated labels.
Automated Annotation Testing
To perform automated annotation task, steps can be found
here:
https://github.com/opencv/cvat/blob/develop/cvat/apps/docu
mentation/user_guide.md#automatic-annotation
Automatic annotation of tasks will run one at a time. The Fig. 16. Video No.2 Task 1
percentage of completion can be monitored in the task page.
The proponent also setup a CVAT on a local machine Fig. 18. Video No.1 Task 1
with the same specifications as the instance, created tasks,
and perform automated annotation similar with above. This
is only for comparison purposes of the processing time
between a cloud and local machine-based annotation run
through a deep learning model.
Automated Annotation Output
Here are sample outputs from cloud-based automated
annotation:
Fig. 19. Video No.1 Task 2
From the statistics above, it can be seen that the mean
and median value of the processing time for cloud-based
implementation are almost fifty percent (50%) lower than that
of local machine. Also, the standard error for cloud is slightly
lower than the local as well as the standard deviation and
variance. These results imply that cloud implementation of
the chosen deep learning model retain and even surpass the
performance of local-based in terms of how fast automatic
Fig. 20. Video No.2 Task 1 annotation can be completed. This is also evident in the rest
of parameters including the Sum or the total processing time
of the cloud which is 131 minutes or more than 2 hours for 5
tasks as compared to local which is 230 minutes or almost 4
hours. Regarding the confidence level, the cloud results have
much lower value than the local-based. This may not look
good however, this only suggest that more samples may be
needed to gain more confidence on the results reflecting the
true population parameter.
Fig. 21. Video No.2 Task 2 B. Cloud vs Local Processing Time Chart
Below are the graphs representing the processing time for
IV. DATA AND RESULTS each task made on two (2) video datasets.
This chapter discusses and illustrates measures of
performance of implementing machine learning in cloud. A
comparison of processing time between cloud and local
machine execution will be shown. The proponent would like
to prove the viability of neural network implementation for
this study taking into consideration similar if not better
performance than in local machine. With that said, the
following were made sure to be consistent between a local
device and an instance: k80 GPU device, video datasets,
annotation tool (CVAT), deep learning model, labels, and
time of implementation.
A. Descriptive Statistics Fig. 24. Cloud vs Local Processing Time for Video Dataset No. 1
Video Dataset No. 1
Fig. 25. Cloud vs Local Processing Time for Video Dataset No. 2
Fig. 22. Cloud vs Local Descriptive Statistics No. 1
The diagrams depict smaller processing time value in
Video Dataset No. 2 all tasks run on the cloud for both video datasets. Thus,
automatic annotation on cloud completes faster than the local
machine used.