Unsupervised Knowledge Transfer For Object Detection in Marine Environmental Monitoring and Exploration
Unsupervised Knowledge Transfer For Object Detection in Marine Environmental Monitoring and Exploration
2020.
Digital Object Identifier 10.1109/ACCESS.2020.3014441
ABSTRACT The volume of digital image data collected in the field of marine environmental monitoring and
exploration has been growing in rapidly increasing rates in recent years. Computational support is essential
for the timely evaluation of the high volume of marine imaging data, but often modern techniques such as
deep learning cannot be applied due to the lack of training data. In this article, we present Unsupervised
Knowledge Transfer (UnKnoT), a new method to use the limited amount of training data more efficiently.
In order to avoid time-consuming annotation, it employs a technique we call ‘‘scale transfer’’ and enhanced
data augmentation to reuse existing training data for object detection of the same object classes in new image
datasets. We introduce four fully annotated marine image datasets acquired in the same geographical area
but with different gear and distance to the sea floor. We evaluate the new method on the four datasets and
show that it can greatly improve the object detection performance in the relevant cases compared to object
detection without knowledge transfer. We conclude with a recommendation for an image acquisition and
annotation scheme that ensures a good applicability of modern machine learning methods in the field of
marine environmental monitoring and exploration.
INDEX TERMS Object detection, knowledge transfer, deep learning, marine environmental monitoring,
image annotation.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
143558 VOLUME 8, 2020
M. Zurowietz, T. W. Nattkemper: UnKnoT for Object Detection in Marine Environmental Monitoring and Exploration
Considering the lack of annotated training data and the In the following section, the UnKnoT method is presented
high cost of time-consuming manual image annotation, it is in detail, describing the individual steps for scale transfer,
desirable to have a computer vision system for automated data augmentation and object detection (see Section II). Code
or assisted image annotation that does not require extensive has been made available with this publication and can be
retraining for each new dataset. Such a computer vision sys- accessed on GitHub.1 The experimental setup that was used
tem must be able to adapt to the changes between datasets to evaluate the method is presented in Section III, including
as described above, where OOI of the same class may dif- the four datasets, referred to as S083, S155, S171 and S233.
fer in their visual appearance. Assisted by such a system, The datasets have been made available with this publication
marine scientists would only have to annotate one dataset, [34]–[37] and can be visually explored in BIIGLE 2.02 . The
and the time required for object detection, which is the most experimental results are summarized in Section IV and dis-
time-consuming part of manual image annotation [6], would cussed in Section V. The manuscript ends with a conclusion
be greatly reduced for the remaining datasets of the same about the relevance of our results and the UnKnoT method
geographical area. In this way, the knowledge consisting of for marine image annotation.
images and annotations that were collected previously can be
transferred and is not lost. II. METHODS
Knowledge transfer in the context of marine environmental In the UnKnoT approach, knowledge is represented by a
monitoring and exploration has previously been presented source dataset Ds and annotations that were manually created
by Skaldebø et al. [29], who attempt to transfer the knowl- by domain experts. The knowledge is transferred by trans-
edge obtained in a simulated underwater environment to the forming Ds and the annotations so a deep learning model can
real environment. First, artificial 3D-rendered images are be trained to perform object detection on a target dataset Dt
created, showing scenes similar to the real images. Then which has not been annotated. The entire process consists
CycleGAN [30] is used to make the artificial images look of three consecutive steps which are described in detail in
more realistic. Walker et al. [31] use physics based color the following sections (see also Fig. 2). In the first step,
correction and scale normalization on underwater images to scale transfer is applied to the images I s of the annotated
reduce the generalization error of a DeepLabV3+ model [32] source dataset Ds . This transforms the visible OOI in I s to
for image segmentation. Similarly, Yamada et al. [33] use a similar scale than the OOI in the images of the target
color correction and image rescaling to enhance their method dataset Dt . A set of annotation patches As→t is extracted from
for unsupervised feature learning of georeferenced sea floor the scale-transferred images, where each annotation patch is
images. All methods are applied to a single dataset and are not a cropped image centered on an annotated OOI. In the sec-
used for knowledge transfer to enable cross-dataset machine ond step, enhanced data augmentation is applied to increase
learning. the size and variety of the set of annotation patches As→t ,
In this article we present Unsupervised Knowledge Trans- resulting in the set of augmented annotation patches As⇒t .
fer (UnKnoT), a new method for object detection in marine In the final step, the set As⇒t is used to train a Mask R-CNN
environmental monitoring and exploration. The method model [38] which is subsequently applied for object detection
employs a technique we call ‘‘scale transfer’’ and enhanced on the target dataset Dt .
data augmentation to adapt one image dataset to the visual
properties of another image dataset and to reuse existing A. SCALE TRANSFER
image annotations for object detection. To the best of our On most deployments of an AUV or OFOS, the observation
knowledge, UnKnoT is the first method that addresses the platform moves at a fixed distance to the sea floor. This
reuse of existing image annotations for cross-dataset machine ensures an almost stable scale and illumination of OOI in
learning in marine environmental monitoring and explo- the images that are captured during the same deployment.
ration. To evaluate the method, we introduce four fully The distance to the sea floor may vary between two deploy-
annotated marine image datasets collected in the same geo- ments, though. An OFOS is usually operated much closer
graphical area but with varying gear and distance to the to the sea floor than an AUV and even the same observa-
sea floor. Our experiments show that UnKnoT can greatly tion platform can be operated at different target distances
improve the object detection performance in the relevant on different deployments. This can result in highly different
cases compared to object detection without knowledge trans- scales for the same classes of OOI in different image datasets
fer. In combination with the existing MAIA method, UnKnoT (see Fig. 4). Fully convolutional neural networks for instance
can be used instead of novelty detection in Stage I of segmentation or object detection such as Mask R-CNN [38]
MAIA to generate more accurate suggestions for OOI if are usually scale-invariant because they are trained on large
the images of the datasets show the same classes of OOI. image datasets in which many scales of objects of the same
Taking this into account, we conclude with a recommen- class occur. In this context, however, the scales of OOI of
dation for an image acquisition and annotation scheme that
ensures a good applicability of modern machine learning 1 https://github.com/BiodataMiningGroup/unknot
methods in the field of marine environmental monitoring and 2 https://biigle.de/projects/237, login: [email protected], pass-
exploration. word: UnKnoTpaper
FIGURE 2. The UnKnoT method. (1) Scale transfer: Images from an annotated source dataset Ds are
transformed to the set of scale-transferred images I s→t (a) and the set of annotation patches As→t is
extracted (b). (2) Data augmentation: The size and variety of the annotation patches As→t is increased
through data augmentation, resulting in the set of augmented annotation patches As⇒t (c). (3) Object
detection: A Mask R-CNN model is trained on As⇒t and applied to the images of Dt to produce the final
object detections (d).
the same class and dataset have a very low variance owing patches As→t (see Fig. 2b). The annotation patches are passed
to the fixed distance of the observation platform to the sea to the next step for data augmentation.
floor. In addition, the datasets usually have a much lower total
number of annotations than in other scenarios. B. DATA AUGMENTATION
To mitigate the scale shift between different datasets, scale Data augmentation is often used to increase the size and
transfer transforms the images I s of an annotated source variety of data that is available to train a machine learning
dataset Ds to make the OOI appear at a similar scale than model. This can often improve the performance of the trained
the OOI in the images I t of the target dataset Dt . The source model [40], [41]. In the context of computer vision tasks such
dataset Ds = {(Iis , dis )} and the target dataset Dt = {(Iit , dit )} as object detection or classification, common augmentation
consist of tuples of an image Ii and the distance di of the methods include operations such as horizontal or vertical
observation platform to the sea floor when the image was flipping, rotation or blurring of images. Viable augmentation
captured. The average distance to the sea floor of the target operations highly depend on the visual domain of the image
dataset is denoted as d¯t : datasets (e.g. vertical flipping makes sense for the image of a
|I t | football but not for a face).
1 X t In case of images of the sea floor captured with an AUV
d¯t = t di (1)
|I | or OFOS, augmentation operations such as flipping, rotation
i=1
or blurring can be applied. The OOI in the images are mostly
Each image Iis ∈ I s has a width of wi and a height of living organisms with a symmetric shape, which makes the
hi pixels. To apply scale transfer to an image Iis , the scale flipping operations viable. In addition, the OOI in the images
transfer factor dis→t is calculated first as defined in (2). Next, are photographed from the top, so they can occur at any rota-
each image Iis is scaled to the width ws→ti and height hs→t
i as tion angle. Different camera properties, motion of the obser-
can be seen in (3) and (4), respectively, resulting in the set vation platform or optical distortion by the water column can
of scale-transferred images I s→t (see Fig. 2a). A three-lobe introduce varying degrees of blur. An object detection model
Lanczos kernel is applied for downscaling (i.e. dis→t < 1) and that was trained partially with blurred images through data
a cubic filter is applied for upscaling (i.e. dis→t > 1) which augmentation can be more robust for these cases.
are the recommended methods of the VIPS image processing In case of UnKnoT, the machine learning model is trained
library [39]. with images of one dataset and applied to images of another
dataset. The images can be captured with different observa-
dis
dis→t = (2) tion platforms and different cameras, and are usually avail-
d¯t able as JPEG files. Different camera and storage settings can
ws→t
i = wi · dis→t (3) produce JPEG files with a varying degree of compression,
hs→t
i = hi · dis→t (4) which can introduce characteristic compression artifacts in
the images. We propose to use artificial JPEG compression
From each image in I s→t the annotated OOI are extracted as augmentation operation to make an object detection model
as 512 × 512 pixel crops which form the set of annotation more robust for the application on different datasets.
In UnKnoT, data augmentation is applied to the annotation image annotation. The method was tested in comprehensive
patches As→t at each step during training of the Mask R-CNN experiments on different combinations of datasets to evaluate
model (see the following section). For each step, a ran- the effectiveness of scale transfer and enhanced data augmen-
dom selection of zero to all of the following augmentation tation for unsupervised knowledge transfer.
operations is applied: horizontal flipping, vertical flipping,
rotation by 90, 180 or 270 degrees, Gaussian blur with a A. DATASETS
random standard deviation σ ∈ [1.0, 2.0] and artificial JPEG The four image datasets used to evaluate UnKnoT are referred
compression with a random compression factor c ∈ [25, 50]. to as S083, S155, S171 and S233. Each dataset consists
The set of augmented annotation patches is denoted as As⇒t of 550 randomly selected images from the image collec-
(see Fig. 2c). tions [22] (S083), [23] (S155), [24] (S171) and [25] (S233).
The image collections were acquired during the 2015 cruises
C. OBJECT DETECTION SO242/1 and SO242/2 of research vessel SONNE at the Peru
Object detection is performed in a similar way than in Basin Disturbance and Colonization (DISCOL) area [43].
Stage III of the MAIA method [14] which has been shown to The images of the different datasets were captured using
be effective in this particular context, with a few differences different observation platforms (OFOS and AUV) as well as
that are described in the following. In Stage III of MAIA, different average distances to the sea floor (see Table 1).
a Mask R-CNN model [42] is trained on an augmented set
of training samples using pre-trained weights of the COCO TABLE 1. Properties of the four datasets that were used to evaluate
dataset [16]. The trained model is applied to an image collec- UnKnoT with the observation platform, average distance and standard
deviation of the camera to the sea floor, and the number of images and
tion for the segmentation of ‘‘interesting’’ pixel regions in the annotations in the train and test splits.
images, which are subsequently converted into circle annota-
tions. In UnKnoT, the Mask R-CNN model is trained using
the set As⇒t of augmented annotation patches, as well as the
pre-trained weights of the COCO dataset. The data augmen-
tation used in Stage III of MAIA is replaced by the enhanced
data augmentation described in the previous section. Differ-
ent to the training configuration of MAIA and [42], a value
of 0.85 is used for the RPN_NMS_THRESHOLD, which The image annotations are based on a subset of ten
increases the number of region proposals during training. morphological classes of the fauna identification guide pre-
In this context, a higher number of region proposals during sented in [28] (see Fig. 4). The images were annotated in
training is beneficial for the detection of very small objects BIIGLE 2.0 [5] using MAIA [14] with an additional review
in the presence of very large and salient objects in the same using the Lawnmower tool to annotate OOI that were missed
image. In addition, a stepped learning rate decay is used to by MAIA. In total, the datasets contain 10,784 manual anno-
improve convergence of the object detection performance of tations on 2,200 images. Compared to datasets of other
experiment replicates. For the stepped learning rate decay, research areas such as the detection of everyday objects,
the heads layers are trained for 10 epochs each with a the datasets presented here may seem rather small. However,
learning rate of 10−3 , 5 · 10−4 and 10−4 , and all layers for the acquisition of annotations in marine images is much more
another 10 epochs each with a learning rate of 10−4 , 5 · 10−5 costly, as it requires more training and background knowl-
and 10−5 , resulting in a total of 60 training epochs compared edge in marine biology. This makes it infeasible to generate
to the 30 epochs of the training configuration of MAIA. datasets as large as e.g. COCO [16] to evaluate machine
One epoch consists of 400 steps and in each step, a batch learning methods in this research area.
of five images is processed. Training took about five hours The datasets S083, S155, S171 and S233 have been made
per dataset on a single NVIDIA Tesla V100. Inference is available with this publication [34]–[37]. Example images
performed on the images I t of the target dataset in the same with annotations can be found in the supplementary material.
way than in Stage III of MAIA [14] (see Fig. 3). The final
result is a set of circle annotations, enclosing potential OOI B. EVALUATION METRIC
in I t (see Fig. 2d). A common metric to evaluate the performance of an object
detection method is the mean average precision [44]. In this
III. EXPERIMENTAL SETUP context, object detections are produced based on the seg-
Four fully annotated image datasets were created to eval- mentation output of Mask R-CNN as described in [14]
uate the UnKnoT method. The datasets were captured in (see Fig. 3). This allows only the calculation of the ‘‘recall’’
the same geographical area, showing the same classes of (i.e. the percentage of OOI that were detected) and the ‘‘pre-
OOI, but with different observation platforms and distances cision’’ (i.e. the percentage of correct detections in the final
to the sea floor. In addition, a new metric to measure the result) but does not allow the ranking of the detections, so the
effectiveness of UnKnoT was created, which accounts for the mean average precision is not applicable. Another metric
desired properties of an object detection method for marine which is the harmonic mean of the recall and precision is
FIGURE 3. Example for inference with Mask R-CNN and the final object detection on a subsection of image TIMER_2015_09_04at09_13_31IMG_
0864.jpg of the S155 dataset. The image (a) is processed by Mask R-CNN which returns a segmentation mask for ‘‘interesting’’ pixels (b). The regions of
interesting pixels are converted to circle annotations for the final detections (c). The full image with manual annotations can be found in the
supplementary material.
the F1 -Score [45]. A variant of the F1 -Score is the F2 -Score the dataset (Atest ). The train splits consist of the remain-
which puts a higher weight on the recall and which has been ing images (Itrain ) and annotations (Atrain ) of the respective
used in a similar context to evaluate the object detection dataset (see Table 1). For evaluation, UnKnoT was applied to
performance of the MAIA method [14]: a given train split as source dataset Ds and a given test split
5 · precision · recall as target dataset Dt .
F2 (recall, precision) = (5) All combinations of using two datasets as Ds and Dt were
(4 · precision) + recall
evaluated in experiments using the following methods for
In case of an object detection method for images in Ds →Dt ), UnKnoT without enhanced
comparison: UnKnoT (Esc,tr,au
marine environmental monitoring and exploration, a mini- training compared to MAIA (Esc,au Ds →Dt ), UnKnoT without
mum of 80% for the recall and 10% for the precision can Ds →Dt ), UnKnoT only with
enhanced image augmentation (Esc,tr
be considered acceptable [14]. The F2 -Score does not take Ds →Dt ) and the baseline configuration of the
this into account. For example, it is possible to achieve a scale transfer (Esc
higher F2 -Score based on a precision of 20% and a recall MAIA object detection stage without any knowledge transfer
s t
of 70% than an F2 -Score based on a precision of 10% and a (E D →D ). The subscripts ‘‘sc’’, ‘‘tr’’ and ‘‘au’’ refer to scale
recall of 80%. In this context, the latter result would be more transfer, enhanced training and enhanced image augmenta-
s t
desirable and should yield a higher score in the evaluation. tion, respectively. For all experiments except E D →D the
s t
combinations D = D were not evaluated as this would mean
As a consequence, we do not apply the F2 -Score as the
primary metric to evaluate UnKnoT. Instead, we propose the knowledge transfer within the same dataset. Each experiment
‘‘Logistic Score’’ (L-Score) as a new metric which is better was repeated three times and the average L-Score was calcu-
suited to evaluate marine object detection regarding a mini- lated as final performance. We denote the average resulting
Ds →Dt and E Ds →Dt as L Ds →Dt
L-Score of the experiments Esc,tr,au
mum recall of 80% and a minimum precision of 10%. The s t
sc,tr,au
L-Score is the harmonic mean of the two logistic functions and L D →D , respectively.
Lr to assess the recall and Lp to assess the precision (see (6),
(7) and (8)). Lr is centered on the value of 80% recall with
a growth rate that yields Lr (1) ≈ 1 (see Fig. 5a) and Lp is IV. RESULTS
centered on the value of 10% precision with a growth rate that The effect of scale transfer that is applied with UnKnoT can
yields Lp (0) ≈ 0 (see Fig. 5b). The L-Score produces high be seen in Fig. 6. In case of source dataset S083, scale transfer
scores for a recall close to or greater than 80% and a precision magnifies the OOI with a factor of dis→t > 1 (see Fig. 6 first
close to or greater than 10%, and low scores otherwise (see row). In contrast to that, the size of the OOI of the source
Fig. 5c). datasets S155 and S171 is reduced with a factor of dis→t < 1
1 during scale transfer, with the exception of AS155→S171 where
Lr (recall) = (6) the size is marginally increased (see Fig. 6 second and third
1 + e−0.25·(100·recall−80)
row). In case of S233 as source dataset, the size of the OOI
1
Lp (precision) = −0.5·(100·precision−10)
(7) is both increased with S155 and S171 as target datasets and
1+e decreased with S083 as target dataset (see Fig. 6 fourth row).
2 · Lr (recall) · Lp (precision)
L(recall, precision) = (8) Table 2 shows the average resulting L-Scores of the exper-
Lr (recall) + Lp (precision) s t Ds →Dt with
iments E D →D without knowledge transfer and Esc,tr,au
s t
knowledge transfer. The experiments E D →D without knowl-
C. EXPERIMENTS edge transfer show the highest scores for the cases Ds = Dt ,
To evaluate the UnKnoT method, each of the four datasets where the images of source and target come from the same
was separated into train and test splits. The test splits consist dataset. The experiments E S155→S171 and E S171→S155 show
of images Itest that contain ≈ 10% of the annotations of almost identical scores to the experiments with Ds = Dt of
FIGURE 5. The harmonic mean of the two logistic functions Lr (a) and Lp
(b) forms the L-Score (c).
FIGURE 4. Examples for the ten classes of OOI (rows) of each of the four
image datasets (columns) that were used to evaluate UnKnoT. Scales of
OOI can vary drastically between different datasets. FIGURE 6. Annotation patches of the ‘‘sea cucumber’’ class without scale
transfer (dashed outline on the main diagonal) compared to
scale-transferred annotation patches of As→t . The rows denote the
source dataset and the columns denote the target dataset (e.g. the patch
in the first row and second column is from AS083→S155 ). Annotation
s patches produced with a scale transfer factor of d s→t > 1 are marked
these datasets. The experiments E D →S083 with Ds 6 = S083 as
with a + and patches produced with a scale transfer factor of d s→t < 1
well as E S083→S155 and E S083→S171 show a score close to 0. are marked with a −.
Ds →Dt with knowl-
Eight of the twelve experiments Esc,tr,au
edge transfer show higher scores than the experiments
s t
E D →D with the same combination of datasets. The produced by Mask R-CNN show only crude region proposal
L-Scores are increased by an average of 0.32. However, boxes instead of the refined regions of a valid segmentation
further inspection of the output of Mask R-CNN reveals (see Fig. 7). Similarly, the segmentation results for the exper-
S083→S155
invalid segmentation results for the experiments Esc,tr,au S083→S233 are not as refined as desired. The score of
iment Esc,tr,au
S083→S171
and Esc,tr,au . In these experiments, the segmentations S233→S155
Esc,tr,au is decreased when compared to object detection
TABLE 2. Average resulting L-Score of the experiments without TABLE 3. Standard deviation of the area of the circle annotations per
s t s t
D →D ) and the class and dataset, and the average over all datasets, given as multiples of
knowledge transfer (LD →D ), with knowledge transfer (Lsc,tr,au
the average annotation area of the respective class. Rows with an
average increase of the L-Score through knowledge transfer. Experiments
average standard deviation of less than 1.5 are highlighted.
based on a scale transfer factor of dis→t < 0.9 are highlighted.
S233→S171
without knowledge transfer whereas the score of Esc,tr,au
shows one of the highest valid increases. When the detection
is limited to the subset of OOI classes that have an average TABLE 5. Average resulting L-Score, recall and precision of all
intra-class area standard deviation of less than 1.5 times their experiments with a scale transfer factor of dis→t < 0.9.
average annotation area (‘‘Coral’’, ‘‘Crustacean’’, ‘‘Ipnops
fish’’ and ‘‘Ophiuroid’’, see Table 3), the L-Scores of both
experiments converge to 0.58 ± 0.10 (S233 → S155) and
0.86 ± 0.03 (S233 → S171) but are still not equal. All these
experiments are exclusively the cases where scale transfer
was applied with a factor of dis→t > 1. Among the remain-
S171→S155 shows a slightly decreased
ing experiments only Esc,tr,au
L-Score compared to object detection without knowledge
transfer. In this experiment, a scale transfer factor of 0.9 < V. DISCUSSION
dis→t < 1 was applied. The average increase of L-Scores of The UnKnoT method applies knowledge transfer from a
the remaining experiments, where a scale transfer factor of source dataset Ds with existing annotations to a target
dis→t < 0.9 was applied, is 0.58. dataset Dt for object detection. The knowledge transfer con-
The detailed results of all experiments including L-Score, sists of scale transfer, which adapts the scales of OOI in the
recall and precision are presented in Tables 4 and 5. source dataset Ds to the scales of OOI in the target dataset Dt ,
and of enhanced data augmentation for typical images of the the high difference in the scale of OOI is the cause for the bad
sea floor. object detection performance.
Fig. 6 shows that the scale transfer effectively transforms Ds →Dt can be separated into the same
The experiments Esc,tr,au
the scale of OOI of the source dataset to the scale of OOI of two scenarios as the annotation patches of Fig. 6 mentioned
the target dataset. First we will review the results obtained above, where the scale transfer is only effective in the cases
for experiments with a scale transfer factor of dis→t > 1 where a scale transfer factor of dis→t < 1 is applied.
(see patches marked with + in Fig. 6). In this scenario, The experiments where the source dataset Ds has a higher
the images of the source dataset have been transformed by average distance to the sea floor than the target dataset Dt
upscaling, as the OOI of the target dataset are shown larger belong to the first scenario. Even though UnKnoT produces
and more detailed. In a real setting, the images of the target an improved object detection performance in some of these
dataset would have been captured by an AUV or OFOS closer cases, the segmentation results of Mask R-CNN are invalid
to the sea floor as in the previous dives. In case of S083 (i.e. not as refined as desired) or the object detection per-
as source dataset, the OOI are transformed to a scale that formance is highly affected by the intra-class area standard
matches the scale of the OOI in the target dataset. However, deviation of the annotations. An invalid segmentation (as can
the scaling blurs the OOI and they do not appear as in focus be seen in Fig. 7) can be the result of OOI that were highly dis-
as the OOI in the target datasets. The results are similar but torted by a large scale transfer factor dis→t 1 so the trained
not as pronounced in case of S233 as target dataset. In the Mask R-CNN model cannot produce a meaningful segmen-
opposite scenario, where the images of the target dataset tation for the target dataset. Although the datasets S155 and
would have been captured further away from the sea floor S171 are very similar in terms of the average distance of the
than in the previous dives, the images of the source dataset camera to the sea floor, they show very different L-Scores in
are transformed by downscaling with a scale transfer factor S233→S155 and E S233→S171 . A closer look
the experiments Esc,tr,au sc,tr,au
of dis→t < 1 (see patches marked with − in Fig. 6). In case of at the intra-class area standard deviations of the annotations
S083 as the target dataset, the scale of the OOI matches the reveals that the compositions of annotations of some classes
scale of the OOI of the target dataset and the OOI appear in differ between these datasets (see Table 3). A high intra-class
focus. Considering only the visual appearance of the OOI, area standard deviation can be amplified by scale transfer
UnKnoT works more effectively if the source dataset was and can potentially result in unrealistically large OOI in the
captured closer to the sea floor than the target dataset and annotation patches of the source dataset. A limited amount
the scale of the annotated OOI is reduced during knowledge of training samples per class and an equally high intra-class
transfer. This observation is confirmed by the experimental standard deviation in the target dataset can lead to highly
results. different object detection performances, even if the source
s t
The experiments E D →D with Ds = Dt , where the datasets were captured at a similar average distance to the
images of source and target come from the same dataset, sea floor. When limited only to classes that show an average
show the highest average L-Scores. This is to be expected, intra-class area standard deviation of less than 1.5 times their
as Mask R-CNN is trained with OOI that appear most similar average annotation area (see Table 3), the L-Scores produced
to the OOI that should be detected. These experiments can be by the experiments converge, but are still not equally high.
seen as baseline with the best possible object detection perfor- This confirms the observation that UnKnoT is not well suited
mance in this context. Notably, the experiments E S171→S155 for cases where the source dataset was captured at a higher
and E S155→S171 show a score almost equal to E S155→S155 distance to the sea floor than the target dataset.
and E S171→S171 , respectively. Although these datasets differ The experiments where the source dataset Ds has a lower
in the distribution of annotations in the images (cf. |Itest | and average distance to the sea floor than the target dataset
|Atest | in Table 1), both datasets were captured at a similar dis- Dt belong to the second scenario. Among these cases only
tance to the sea floor with an OFOS. When Mask R-CNN is S171→S155 , in which a scale transfer factor of 0.9 < d s→t <
Esc,tr,au i
trained on one dataset and applied to the other, no knowledge 1 was applied, shows a slightly decreased L-Score compared
transfer is required to achieve a very good object detection to object detection without knowledge transfer. This high-
performance. Other notable results are given by the exper- lights a drawback of the proposed L-Score, as only small
s
iments E D →S083 with Ds 6 = S083, as well as E S083→S155 changes in the precision and/or recall can cause high dif-
and E S083→S171 which show a score close to 0. Such a low ferences in the L-Score if the score is already high. In case
L-Score is produced if either the recall is bad (i.e. < 80%), S171→S155 , the lower L-Score is produced by a slightly
of Esc,tr,au
the precision is bad (i.e. < 10%) or both. In case of the three lower precision of 12% compared to 16% of E S171→S155
s
experiments E D →S083 with Ds 6 = S083, the low average and an actually slightly higher recall of 90% compared to
recall of 47% is the cause for the low L-Score (see Table 4). 88% of E S171→S155 (see Table 4). Still, even if UnKnoT
Trained with OOI at a much larger scale, Mask R-CNN is does not have a negative impact on the object detection
unable to achieve an adequate recall in these cases. For the performance in this case, it does not improve the perfor-
experiments E S083→S155 and E S083→S171 , the low average mance either. Hence, we only denote the experiments in
precision of 5% causes the low L-Score (see Table 4). Again, which a scale transfer factor of dis→t < 0.9 was applied
as ‘‘relevant’’. These are the cases with a sufficiently large on a large scale using observation platforms such as
difference in the average distance of the camera to the sea AUVs. Following Step 1, the preferred target distance
floor. On average, UnKnoT improves the object detection to the sea floor should be 3.4 m. At this distance,
performance by an L-Score of 0.58 (189%) compared to the images cover a larger area than the images of Step 1,
object detection without knowledge transfer in these cases. potentially containing more OOI (cf. |Atrain | in Table 1).
s
Where the experiments E D →S083 with Ds 6 = S083 produced 3) UnKnoT should be used for object detection with the
a bad average recall of 47%, UnKnoT improves the average annotated dataset of Step 1 as source dataset and each
recall to 86% (see Table 4). Notably, the improved object of the datasets acquired in Step 2 as target dataset.
detection performance is highest for S233 → S083 compared 4) MAIA [14] should be used for the final image annota-
to S155 → S083 and S171 → S083. Also, the object tion of each of the datasets acquired in Step 2, by using
detection performance is improved to a similarly high level the object detection results of Step 3 as training pro-
for S155 → S233 and S171 → S233, in case of Esc,tr,au S171→S233 posals. The object detection results of Step 3 replace
even surpassing the baseline average L-Score of E S233→S233 . the results of the novelty detection stage of MAIA and
These results indicate that UnKnoT produces a better object ensure a highly specialized Mask R-CNN model for
detection performance with a source dataset that was captured each individual dataset in the instance segmentation
at an average distance to the sea floor that is roughly half the stage.
average distance of the target dataset. This image acquisition and annotation scheme can be an
Considering only the relevant experiments, scale transfer efficient way to produce large volumes of high-quality image
accounts for most of the improvements in the object detection annotations in typical scenarios of the field of marine envi-
performance. The additional enhanced training configuration ronmental monitoring and exploration.
of Mask R-CNN and the data augmentation improve the In summary, we presented UnKnoT, a new method for
object detection performance even further (see Table 5). unsupervised knowledge transfer that allows the reuse of
existing knowledge in the form of image annotations for
VI. CONCLUSION object detection in new marine image datasets that show sim-
Based on the observations and experimental results we draw ilar OOI. In addition, we presented the L-Score, a metric that
the following conclusions: If the annotated source dataset and is better suited to evaluate the object detection performance
the target dataset are very similar in terms of average distance in this particular context. We evaluated the effectiveness of
to the sea floor and observation platform, no knowledge trans- UnKnoT with four fully annotated image datasets, compris-
fer is required to achieve a good object detection performance ing a total of 10,784 annotations on 2,200 images captured
with a machine learning model such as Mask R-CNN. If the in the same geographical area at different distances to the
annotated source dataset was captured at roughly half the sea floor. Our experimental results have shown that UnKnoT
distance to the sea floor than the target dataset, UnKnoT greatly improves the object detection performance compared
can be used to greatly improve the object detection perfor- to object detection without knowledge transfer in the relevant
mance in an unsupervised way. As the discrepancy in average cases. Based on these results, we conclude by recommend-
distances to the sea floor increases, the increase in object ing a four-step image acquisition and annotation scheme for
detection performance by UnKnoT decreases, but the final future studies, which can be an efficient way to produce large
object detection is still much better than if no knowledge volumes of high-quality image annotations in the field of
transfer is performed. marine environmental monitoring and exploration.
To ensure a good applicability of machine learning meth-
ods such as UnKnoT for marine image annotation, we pro- REFERENCES
pose a four-step image acquisition and annotation scheme for [1] K. J. Morris, B. J. Bett, J. M. Durden, V. A. I. Huvenne, R. Milligan,
future studies of the same geographical area: D. O. B. Jones, S. McPhail, K. Robert, D. M. Bailey, and H. A. Ruhl,
1) One dataset with images of the sea floor should be ‘‘A new method for ecological surveying of the abyss using autonomous
underwater vehicle photography,’’ Limnol. Oceanography, Methods,
captured close to the ground and the current distance vol. 12, no. 11, pp. 795–809, Nov. 2014.
to the sea floor should be recorded for each image. [2] T. Schoening, K. Köser, and J. Greinert, ‘‘An acquisition, curation and man-
The images should be fully annotated in a manual agement workflow for sustainable, terabyte-scale marine image analysis,’’
way using a software such as BIIGLE 2.0 [5]. A target Sci. Data, vol. 5, no. 1, Dec. 2018, Art. no. 180181.
[3] R. Proctor, T. Langlois, A. Friedman, S. Mancini, X. Hoenner, and
distance to the sea floor of 1.7 m should be preferred as B. Davey, ‘‘Cloud-based national on-line services to annotate and analyse
OOI are likely to be easy to identify at this distance. underwater imagery,’’ in Proc. IMDIS Int. Conf. Mar. Data Inf. Syst.,
Methods to assist image annotation such as MAIA vol. 59, 2018, p. 49.
[4] B. Schlining and N. Stout, ‘‘MBARI’s video annotation and reference
[14] can be used to speed up the image annotation system,’’ in Proc. OCEANS, Sep. 2006, pp. 1–5.
process. [5] D. Langenkämper, M. Zurowietz, T. Schoening, and T. W. Nattkemper,
2) The remaining image datasets should be captured at ‘‘BIIGLE 2.0–browsing and annotating large marine image collections,’’
Frontiers Mar. Sci., vol. 4, p. 83, Mar. 2017.
twice the distance to the sea floor than the dataset from
[6] T. Schoening, J. Osterloff, and T. W. Nattkemper, ‘‘Recomia—
Step 1 and should also record the current distance to the Recommendations for marine image annotation: Lessons learned
sea floor for each image. Image acquisition can be done and future directions,’’ Frontiers Mar. Sci., vol. 3, p. 59, Apr. 2016.
[7] J. Monk, N. S. Barrett, D. Peel, E. Lawrence, N. A. Hill, V. Lucieer, and [29] M. Skaldebo, A. S. Muntadas, and I. Schjolberg, ‘‘Transfer learning in
K. R. Hayes, ‘‘An evaluation of the error and uncertainty in epibenthos underwater operations,’’ in Proc. OCEANS-Marseille, Jun. 2019, pp. 1–8.
cover estimates from AUV images collected with an efficient, spatially- [30] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, ‘‘Unpaired image-to-image
balanced design,’’ PLoS ONE, vol. 13, no. 9, Sep. 2018, Art. no. e0203827. translation using cycle-consistent adversarial networks,’’ in Proc. IEEE Int.
[8] J. M. Durden, B. J. Bett, T. Schoening, K. J. Morris, T. W. Nattkemper, Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2223–2232.
and H. A. Ruhl, ‘‘Comparison of image annotation data generated by [31] J. Walker, T. Yamada, A. Prugel-Bennett, and B. Thornton, ‘‘The effect
multiple investigators for benthic ecology,’’ Mar. Ecol. Prog. Ser., vol. 552, of physics-based corrections and data augmentation on transfer learning
pp. 61–70, Jun. 2016. for segmentation of benthic imagery,’’ in Proc. IEEE Underwater Technol.
[9] T. Schoening, T. Kuhn, M. Bergmann, and T. W. Nattkemper, ‘‘DELPHI— (UT), Apr. 2019, pp. 1–8.
Fast and adaptive computational laser point detection and visual footprint [32] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, ‘‘Encoder-
quantification for arbitrary underwater image collections,’’ Frontiers Mar. decoder with atrous separable convolution for semantic image segmenta-
Sci., vol. 2, p. 20, Apr. 2015. tion,’’ in Proc. ECCV, Sep. 2018, pp. 801–818.
[10] O. Beijbom, P. J. Edmunds, D. I. Kline, B. G. Mitchell, and D. Kriegman, [33] T. Yamada, A. P. Bennett, and B. Thornton, ‘‘Learning features from
‘‘Automated annotation of coral reef survey images,’’ in Proc. IEEE Conf. georeferenced seafloor imagery with location guided autoencoders,’’
Comput. Vis. Pattern Recognit., Jun. 2012, pp. 1170–1177. J. Field Robot., pp. 1–16, May 28, 2020, doi: 10.1002/rob.21961.
[11] X. Li, M. Shang, H. Qin, and L. Chen, ‘‘Fast accurate fish detection and [34] M. Zurowietz, S083, Jan2020, doi: 10.5281/zenodo.3600132.
recognition of underwater images with fast R-CNN,’’ in Proc. OCEANS- [35] M. Zurowietz, S155, Jan. 2020, doi: 10.5281/zenodo.3603803.
MTS/IEEE Washington, Oct. 2015, pp. 1–5. [36] M. Zurowietz, S171, Jan. 2020, doi: 10.5281/zenodo.3603809.
[12] T. Schoening, M. Bergmann, J. Ontrup, J. Taylor, J. Dannheim, J. Gutt, [37] M. Zurowietz, S171, Jan. 2020, doi: 10.5281/zenodo.3603815.
A. Purser, and T. W. Nattkemper, ‘‘Semi-automated image analysis for [38] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ‘‘Mask R-CNN,’’ in Proc.
the assessment of megafaunal densities at the arctic deep-sea observatory IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 2961–2969.
HAUSGARTEN,’’ PLoS ONE, vol. 7, no. 6, Jun. 2012, Art. no. e38179. [39] J. Cupitt and K. Martinez, ‘‘VIPS: An image processing system for large
[13] M. Moniruzzaman, S. M. S. Islam, M. Bennamoun, and P. Lavery, ‘‘Deep images,’’ Proc. SPIE, vol. 2663, pp. 19–28, Feb. 1996.
learning on underwater marine object detection: A survey,’’ in Proc. Int. [40] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell, ‘‘Understand-
Conf. Adv. Concepts Intell. Vis. Syst. Antwerp, Belgium: Springer, 2017, ing data augmentation for classification: When to warp?’’ in Proc. Int.
pp. 150–160. Conf. Digit. Image Comput., Techn. Appl. (DICTA), Nov. 2016, pp. 1–6.
[14] M. Zurowietz, D. Langenkämper, B. Hosking, H. A. Ruhl, and [41] L. Perez and J. Wang, ‘‘The effectiveness of data augmentation in image
T. W. Nattkemper, ‘‘MAIA—A machine learning assisted image annota- classification using deep learning,’’ 2017, arXiv:1712.04621. [Online].
tion method for environmental monitoring and exploration,’’ PLoS ONE, Available: http://arxiv.org/abs/1712.04621
[42] W. Abdulla. (2017). Mask R-CNN for Object Detection and Instance Seg-
vol. 13, no. 11, Nov. 2018, Art. no. e0207498.
[15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet: mentation on Keras and TensorFlow. [Online]. Available: https://github.
A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput. com/matterport/Mask_RCNN
[43] E. J. Foell, H. Thiel, and G. Schriever, ‘‘DISCOL: A long-term, large-
Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
[16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, scale, disturbance-recolonization experiment in the abyssal eastern tropical
and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ in South Pacific Ocean,’’ in Proc. Offshore Technol. Conf., 1990. [Online].
Proc. ECCV. Zurich, Switzerland: Springer, 2014, pp. 740–755. Available: https://doi.org/10.4043/6328-MS
[17] P. Tang, X. Wang, S. Bai, W. Shen, X. Bai, W. Liu, and A. Yuille, ‘‘PCL: [44] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman,
Proposal cluster learning for weakly supervised object detection,’’ IEEE ‘‘The PASCAL visual object classes (VOC) challenge,’’ Int. J. Comput.
Trans. Pattern Anal. Mach. Intell., vol. 42, no. 1, pp. 176–191, Jan. 2020. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
[18] D. Zhang, J. Han, L. Yang, and D. Xu, ‘‘SPFTN: A joint learning frame- [45] M. Sokolova and G. Lapalme, ‘‘A systematic analysis of performance
work for localizing and segmenting objects in weakly labeled videos,’’ measures for classification tasks,’’ Inf. Process. Manage., vol. 45, no. 4,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 475–489, pp. 427–437, Jul. 2009.
Feb. 2020.
[19] D. Zhang, J. Han, G. Guo, and L. Zhao, ‘‘Learning object detectors with
semi-annotated weak labels,’’ IEEE Trans. Circuits Syst. Video Technol.,
vol. 29, no. 12, pp. 3622–3635, Dec. 2019. MARTIN ZUROWIETZ received the B.Sc. degree
[20] S. J. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans. Knowl.
in bioinformatics and the M.Sc. degree in infor-
Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
[21] E. C. Orenstein and O. Beijbom, ‘‘Transfer learning and deep feature matics in the natural sciences from Bielefeld Uni-
extraction for planktonic image data sets,’’ in Proc. IEEE Winter Conf. versity, Bielefeld, Germany, in 2013 and 2016,
Appl. Comput. Vis. (WACV), Mar. 2017, pp. 1082–1088. respectively, where he is currently pursuing the
[22] J. Greinert, T. Schoening, K. Köser, and M. Rothenbeck, Seafloor Ph.D. degree with the Biodata Mining Group. His
Images and Raw Context Data Along AUV Track SO242/1_83-1_AUV10 research interests include automatic object detec-
(Abyss_196) During SONNE Cruise SO242/1. Bremerhaven, Germany: tion in marine imagery using deep learning meth-
PANGAEA, 2017, doi: 10.1594/PANGAEA.881896. ods, assistance systems for manual marine image
[23] A. Purser et al., Seabed Photographs Taken Along OFOS Profile annotation, and the development of large-scale
SO242/2_155-1 During SONNE Cruise SO242/2. Bremerhaven, Germany: web-based collaborative image annotation platforms.
PANGAEA, 2018, doi: 10.1594/PANGAEA.890617.
[24] A. Purser et al., Seabed Photographs Taken Along OFOS Profile
SO242/2_171-1 During SONNE Cruise SO242/2. Bremerhaven, Germany:
PANGAEA, 2018, doi: 10.1594/PANGAEA.890620.
[25] A. Purser et al., Seabed Photographs Taken Along OFOS Profile TIM W. NATTKEMPER is currently a Professor
SO242/2_233-1 During SONNE Cruise SO242/2. Bremerhaven, Germany: of biodata mining with the Faculty of Technology,
PANGAEA, 2018, doi: 10.1594/PANGAEA.890633. Bielefeld University, Germany. His research inter-
[26] A. Tsymbal, ‘‘The problem of concept drift: Definitions and related work,’’ ests include the development of methods for the
Comput. Sci. Dept., Trinity College Dublin, vol. 106, no. 2, p. 58, 2004. analysis of digital images and video (bioimaging,
[27] D. Langenkämper, R. van Kevelaer, A. Purser, and T. W. Nattkemper, medical imaging, marine imaging, remote, and
‘‘Gear-induced concept drift in marine images and its effect on deep
sensing). One particular focus of TWNs research is
learning classification,’’ Frontiers Mar. Sci., vol. 7, p. 506, Jul. 2020.
[28] T. Schoening, A. Purser, D. Langenkämper, I. Suck, J. Taylor, the development of algorithmic approaches to har-
D. Cuvelier, L. Lins, E. Simon-Lledó, Y. Marcon, D. O. B. Jones, vest large marine image and sensor data collections
T. Nattkemper, K. Köser, M. Zurowietz, J. Greinert, and for hidden regularities. Two very important aspects
J. Gomes-Pereira, ‘‘Megafauna community assessment of polymetallic- are the computational classification/quantification with machine learning
nodule fields with cameras: Platform and methodology comparison,’’ and computer vision and the integration of field expert knowledge through
Biogeosciences, vol. 17, no. 12, pp. 3115–3133, Jun. 2020. [Online]. modern web-platforms and data-driven visualizations.
Available: https://www.biogeosciences.net/17/3115/2020/