0% found this document useful (0 votes)
7 views

A Deep-Learning Method Using Auto-Encoder and Gene

Uploaded by

Chahrazed Cerbah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

A Deep-Learning Method Using Auto-Encoder and Gene

Uploaded by

Chahrazed Cerbah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

A Deep-Learning Method Using Auto-encoder and Generative

Adversarial Network for Anomaly Detection on Ancient Stone


Stele Surfaces
Yikun Liu1 , Yuning Wang2 , Cheng Liu1
arXiv:2308.04426v1 [cs.CV] 8 Aug 2023

1: School of Cultural Heritage, Northwest University, Xi’an, China


2: FLOW, Engineering Mechanics, KTH Royal Institute of Technology, SE-100 44
Stockholm, Sweden.

Abstract
Accurate detection of natural deterioration and man-made damage on the surfaces of ancient
stele in the first instance is essential for their preventive conservation. Existing methods for
cultural heritage preservation are not able to achieve this goal perfectly due to the difficulty of
balancing accuracy, efficiency, timeliness, and cost. This paper presents a deep-learning method to
automatically detect above mentioned emergencies on ancient stone stele in real time, employing
autoencoder (AE) and generative adversarial network (GAN). The proposed method overcomes
the limitations of existing methods by requiring no extensive anomaly samples while enabling
comprehensive detection of unpredictable anomalies. the method includes stages of monitoring,
data acquisition, pre-processing, model structuring, and post-processing. Taking the Longmen
Grottoes’ stone steles as a case study, an unsupervised learning model based on AE and GAN
architectures is proposed and validated with a reconstruction accuracy of 99.74%. The method’s
evaluation revealed the proficient detection of seven artificially designed anomalies and demon-
strated precision and reliability without false alarms. This research provides novel ideas and
possibilities for the application of deep learning in the field of cultural heritage.
Keywords: Deep Learning, Anomaly Detection, Cultural Heritage Conservation,
Generative Adversarial Network, stone stele.

1 Introduction and daily patrols by heritage managers or safety


officers. Nevertheless, these methods have their
Ancient stone steles, which are found at almost limitations. The former requires a wealth of spe-
every major heritage site in China, represent a cialized resources and doesn’t provide immediate
significant part of the cultural heritage. How- feedback, while the latter, despite its regularity,
ever, they are prone to various environmental often overlooks slowly evolving deterioration due
degradation and man-made damage, urgently to a lack of specialization.
needing effective preventive conservation. Over recent years, deep learning techniques
Timely detection of these risks is considered have been increasingly applied in the field of cul-
as the prerequisite of their preventive conser- tural heritage[1]. Although certain studies have
vation. Commonly employed strategies to de- ventured into automated deterioration recogni-
tect risks for these outdoor relics include reg- tion, these solutions are not sufficiently adapted
ular investigation by professional conservators, to meet the needs inherent in preventive con-

1
servation. Current research primarily employs designed to detect and locate anomalies by com-
supervised learning based on deep neural net- paring the reconstruction difference between ab-
works(DNN) to achieve classification [2], [3], [4], normal and normal samples. Finally, the effec-
object detection [5], [6], and semantic segmen- tiveness of the method was verified by a series of
tation [7], [8] of various forms of deterioration simulated test images containing various types of
and damage. However, these methods entail an anomalies and ambient lighting conditions.
over-reliance on a large number of high-quality
labeled samples, a requirement that greatly lim-
its their application. In the cultural heritage do- 3 Study area
main, samples related to deterioration and dam-
ages are terribly difficult to obtain sufficiently The Longmen Grottoes, situated on both sides of
for supervised deep learning. This is further the Yi River south of Luoyang in China, serve as
complicated by the diversity of the deteriora- the study area for this research, focusing on an
tion and damages, which further escalates the open-air stone stele at the Fengxian Temple. In
total requirement of samples. For example, pat- addition to tens of thousands of Buddha statues
terns of man-made damages such as doodle and and more than 60 stupas, there are about 2,800
carving are visually unpredictable, making it im- steles with inscriptions at the heritage [9].
possible to ensure that they are learned by the
model. This directly hinders the possibility of Fengxian Temple, Longmen Grottoes

these deep learning methods to be used for dis-


covering emerging risks. In summary, novel ap-
Luoyang, China
proaches are needed to realize the potential of
deep-learning techniques in preventive conserva-
tion of cultural heritage.

2 Research aim
The research aims at develop a deep-learning
method based on auto-encoder(AE) and gen- inscriptions Stone stele

erative adversarial network(GAN) for real-time


automatic detection towards deterioration and Figure 1: Location of the study area.
damages (hereinafter collectively referred to as
’anomaly’) appearing on the surface of ancient The region where the grotto is located has
stone steles. a temperate continental climate with cold arid
The main technical features this method aims winters, hot rainy summers and high windy
to implement are as follows: eliminates the ex- spring and fall seasons, resulting in the natu-
tensive requirement for anomaly samples during ral deterioration of this open-air stele over a
modeling and remains sensitivity towards unpre- long period of time [10], [11]. Signs of Sur-
dictable anomalies during detection. face weathering are prevalent on the stele, man-
To this end, firstly an HD camera is installed ifested as lateral surface dissolution and exten-
to monitor the surface of the stone stele and to sive black deposit accumulation. Furthermore,
collect normal samples as training data. Then a the presence of top-down crack indicates a po-
AE/GAN-based neural network is built to recon- tential structural safety risk that warrants con-
struct the image features of these normal sam- tinued attention. Additionally, this stele, due to
ples. Subsequently, Post-processing methods are its accessibility to large amount of tourists, is

2
highly susceptible to accidental human-induced of the stele regularly from an identical angle,
damage from the proximate interaction. thereby ensuring the continuity and comparabil-
ity of the collected images. Each of these images
bore a resolution of 3840 × 2748 pixels, featur-
4 Proposed methods ing both vertical and horizontal pixel densities
of 96 dpi, and recorded in 24-bit RGB three-
4.1 Overview of the method channel color. This ensures that the majority
of macro-scale surface changes can be accurately
In response to the need for preventive conserva-
documented. The camera’s shooting frequency
tion of cultural heritage, the proposed method
is set at two images per day, one in the morn-
is supposed to automatically track the emer-
ing and one in the afternoon. This is a compro-
gence of both natural deterioration and man-
mise between maximizing data diversity to train
made damage on the stone stele surface. Specif-
the model and minimizing the storage footprint
ically, we expect that these anomalies should
when collecting high-resolution images over long
be accurately located and rapidly reported to
periods of time.
the conservators at the first sign of occurrence,
so that timely intervention can be made before
more serious damage occurs. 4.3 Pre-processing
The scheme of our proposed method is as fol-
lows (Fig. 2). Initially, a fixed camera serves as Pre-processing of the collected images is an inte-
the monitoring instrument, capturing images of gral step to enhance the quality of our training
the stele surface in its normal state. These im- data.
ages then form the training dataset for a model Firstly, noisy data such as severe overexpo-
we developed based on Auto-encoder and Gen- sure and foreign object occlusion are removed by
erative Adversarial Network (GAN) frameworks. manual inspection. This approach contributes
This trained model is capable of reconstructing to improving model accuracy. However, it has
images of the stele surface in its normal state. to be mentioned that when such noise reoccurs
When an image of the stele surface containing in the future, they could potentially trigger false
anomalies is fed to the model in the future, since alarms.
they are never learned by the model, the recon- Subsequently, the original photographs are
structed image and the input image differ signif- cropped and partitioned into six equal-sized re-
icantly where the anomalies are located. Finally, gions. Each of these regions is resized to a
by choosing appropriate methods to measure this 640x480 resolution, maintaining the RGB three-
difference, it can be determined whether the stele channel color, thus preserving the color infor-
is in an emergency situation, and the location of mation vital for anomaly detection. Therefore,
the anomaly will be segmented and presented as the input size for the proposed model is fixed
a binary image. at 640x480x3. For each distinct region, an in-
dependent model will be trained. When de-
ployed, these six models will operate in paral-
4.2 Data acquisition
lel.The above procedure are taken to prevent
The lower center area of the stele has been cho- potential negative impacts such as model train-
sen as monitoring area, considering that most of ing difficulties and overfitting, which might be
the natural deterioration and man-made damage brought about by overly high resolution. For
risks are concentrated here. illustrative purposes, this study only showcases
A high-definition camera fixed to the side of one selected area.
the stele is used as the monitoring device. This Finally, we implement data augmentation by
camera was programmed to capture the surface adjusting the images’ white balance and expo-

3
Monitoring AE/GAN Network
latent-space
latent-space
Encoder

HD camera
(E)
stele surface

Encoder Decoder

Discriminator
Input Output

Data
Normal

Image registration
Data acquisition

Cleaning
Abnormal

Data
Augmentation Color matching

Bird dropping Pre-processing Compare



Normal Abnormal

SAFE

WARNING

Similarity measurement Image binarization


Post-processing

Figure 2: Flowchart depicting the comprehensive process for anomaly detection on ancient stone
steles. An artificial designed anomaly image with bird dropping is used as an example to show how
the proposed method works.

sure. This adjustment is conceived to simulate sequently disregarded in our approach.


the variations in photographic parameters under
different lighting conditions. For each image, the
exposure value and the white balance are altered
4.4 Proposed neural network
randomly, once each, within a range of +/- 1EV We proposed an modified, unsupervised-learning
and +/- 1000K, respectively. As a result, the model based on GANomaly [12], which is a
volume of training data is amplified by a factor branch of Generate Adversarial Network (GAN),
of two. Notably, given the fixed angle and focal specifically designed for anomaly detection. The
length from which all photographs were taken, architecture, as shown in Fig. 3, comprises a gen-
certain common data augmentation techniques erator (G), an Encoder (E) and a Discriminator
such as rotation, flipping, and zooming were con- (D).

4
The generator G, following the bow-tie archi-
tecture of autoencoder [13], is designed for learn-
ing latent-space representation z. The input
data from the reference space x ∈ X will be re-
duction of encoder GE and we employ a decoder
GD to project the representation to the reference
space. The second sub-network is the discrimi-
nator D which is used to classify input x and
reconstructed image x̂. This sub-network is the
standard discrimator network introduced in DC-
GAN [14]. The third sub-model is the encoder
network E that compress the reconstructed im-
age x̂. It owns the same architecture of genera-
tor encoder GE while holding different parame-
ters. Note that the encoder E is not taken into
the training process. E compress x̂ to derive
its latent-space features vector ẑ which has iden-
tical dimension as features vector z. We take
the l2 norm error between z and ẑ is a part of
loss function. This sub-network is the unique Figure 3: Schematic pipeline of the proposed
parts of GANonmaly [12], which used for stablise GANomaly architecture. Colored cubes denote
the latent space and imporve the reconstruction different types of layer which are implemented
quality. For encoders and decoders in the archi- for compositing each sub-networks. The gener-
tecture, we adopt convolution neural networks ator is annotated via a black rectangle. Dashed
(CNNs) for learning hierarchical representations lines denote features obtained through hypoth-
from the original image domain, aiming at ex- esis of sub-network while arrows point to corre-
tract the anomaly features from the abnormal sponding loss functions.
image.
The input and output of the model, are the in generator by:
image of normal stele x ∈ Rw×h×3 , where w and
Lcon = Ex∼pX ∥x − G(x)∥1 (1)
h denotes the width and height of the image,
respectively. The input is first propagated to The reconstructed image, so-called fake out-
its encoder GE where data is compressed to a put in GAN model, is fed to the discriminator
vector z ∈ Rd by three blocks comprises CNN as a supervision to penalise the generator. In
layers which used for learning hierarchical rep- another word, the generator is trained for ”fool”
resentation and linear layers. z is known as the the discriminator, thus we introduce adversarial
latent-space features of input data and hypoth- loss to penalise the model during training:
esised to obtain the lowest-dimension represen-
tation of x via z = GE (x). Subsequently, the Ladv = Ex∼pX ∥f (x) − Ex∼pX f (G(x))∥2 (2)
latent space representation z is fed to decoder
of generator GD , implementing transposed CNN Additionally, we employ an extra encoder loss
layers for project the representation back to the Lenc to minimize the distance between the la-
reference space. The function of decoder is to tent features vector of input (z = GE (x)) and
upscale the vector z to reconstruct the image x the encoded features of the generated image
as x̂ ∈ Rw×h×3 via x̂ = GD (z). We propose con- (ẑ = E(x̂)). Through this optimization, the gen-
textural loss to update the trainable parameters erator is able to learn how to encode features of

5
the generated image for normal samples. Lenc is feature-based registration, for this task. SURF
defined as: operates by autonomously identifying key fea-
ture points and then accomplishing registration
Lenc = Ex∼pX ∥GE (x) − E(G(x))∥2 (3) by matching them. This method is particularly
appropriate for our study as it is well-suited in
Overall, our total loss function for generator
cases where there are discrepancies in brightness
becomes the following:
and contrast, meeting the need for robustness
L = ωadv Ladv + ωcon Lcon + ωenc Lenc (4) under this study. The image registration in this
study can be simplified as:
where ωadv , ωcon and ωenc are the weight of
corresponding loss object which adjusts the im- X̂ ′ = SURF(X, X̂) (5)
pact of individual losses to the overall loss func-
Where X and X̂ represent the input image and
tion. In the present study, we adopt ωadv = 1,
the reconstructed image respectively. X̂ ′ is the
ωcon = 40 and ωenc = 1 for each weighting re-
reconstructed image after registration. SURF is
spectively.
the function representing the SURF algorithm.

4.5 Post-processing 4.5.2 Color matching


After the completion of model training and de-
Color difference between the input and recon-
ployment, the raw output cannot be immediately
structed images can also confound the accuracy
used for anomaly detection. Post-processing
of the subsequent similarity measurements in our
is designed to quantify the difference between
study. we implemented color matching to make
the input and reconstructed images, and sub-
the reconstructed image as closely aligned as pos-
sequently facilitate binary classification in the
sible to the input image in terms of exposure and
form of segmentation normal and abnormal. The
color temperature, without altering the texture
post-processing steps in this study include: im-
of the image.
age registering, color matching, similarity mea-
Specifically, we begin by normalizing the pixel
surement, binarization.
values of each of the three channels in the recon-
structed image, a process that involved subtract-
4.5.1 Image registration ing the mean and dividing by the standard devia-
The reconstructed image is not perfectly aligned tion. Following normalization, we referenced the
with the input image. The complex texture of input image to rescale and recenter these values.
the stele surface means any misalignment can This involved multiplying the normalized values
introduce substantial noise into the subsequent by the standard deviation of the input image,
similarity measurements, thereby undermining and then adding the mean of the input image.
the accuracy of anomaly detection. Hence, to As a result, the distribution of pixel values in the
address this issue, we incorporate image regis- reconstructed image came into alignment with
tration, a technique that uses image process- those of the corresponding channels in the input
ing algorithms to spatially align multiple images. image. This process is defined as:
Specifically, we adjust the texture of the recon- ′
structed image with reference to each input im- ′′
X̂ijk − µX̂k
X̂ijk = · σXk + µXk , (6)
age, thus ensuring that they align more precisely σX̂k
and are better prepared for subsequent process-
′′
ing stages. where X̂ijk is the pixel value in a specific channel
To this end, we employ the algorithm so-called of reconstructed image after color matching, i, j
Speeded-Up Robust Features (SURF), a type of are the pixel coordinates and k is the channel

6
index. σXk , σX̂k , µXk and µX̂k are the average by linearly stacking the three color channels to
and standard deviation of all pixels respectively comply with the SSIM requirements for input.
in a specific channel. Then we compare the difference between patches
rather than individual pixels of the input and
4.5.3 Similarity measurement reconstructed images with SSIM. In the subse-
quently obtained similarity matrix, each pixel
Given the necessity to localize anomalies, simi- corresponds to the similarity value of the patch
larity matrix rather than mere similarity value is pair which takes that pixel as a vertex. The
employed for similarity measurement. Calculat- SSIM for each patch are defined as:
ing the reconstruction error of each pixel directly
by matrix subtraction(MS) is the most straight- 2µXg µX̂ ′′ + C1
forward and efficient method. Specifically, we SSIM(X, X̂ ′′ ) = 2 g

µXg + µ2X̂ ′′ + C1
first subtract the reconstructed image pixel by g
(8)
pixel from the input image and then decenter the 2σXg X̂ ′′ + C2
g
result by the median of each RGB channel. Fi- · 2 2 + C2 ,
σX + σX̂ ′′
nally we linearly stack the three channels to ob- g

tain the similarity matrix. The decentering cal- ′′


culation is performed to make the reconstruction where Xg and X̂g are the greyscale version of
′′
error small enough in regions without anomalies, X and X̂ . The corresponding mean, standard
which in turn allows a better distinction between deviation and mutual covariance are denoted by
2
normal and abnormal. The operation for each µ, σ and σ , respectively. C1, C2, and C3 are the
pixel is defined as: regularization constants of brightness, contrast
and structural terms, respectively.
MS(X, X̂ ′′ ) =
X
′′ ′′ (7) 4.5.4 Binarization
Xijk − X̂ijk − medk (Xijk − X̂ijk )
k
Utilizing an appropriate detection threshold, the
where MS represents the matrix subtraction similarity value each pixel represents within the
method, and medk stands for median calculation matrix obtained from the prior section is cate-
on the k channels. gorized into one of two classes—normal or ab-
However, After a series of experimental tests, normal. The above process consequently yields
we observed that this method exhibited insuffi- a binary image. This image provides an outline
cient sensitivity to differences in image structure, of anomalies present on the input image in the
thus resulting in compromised detection results. form of image segmentation, enabling the detec-
This could be attributed to the fact that the tion and localization of the anomalies. However,
computation in this method occurs at an indi- it must be noted that the choice of detection
vidual pixel level, without taking into account threshold significantly influences the quality of
the correlation between pixels. detection. Setting the threshold too low could
To ensure a comprehensive coverage of various label an excessive number of pixels as anoma-
anomalies in similarity measurements, we sought lies, potentially causing false alarms due to noise,
to complement the matrix subtraction method whereas setting it too high could lead to the op-
with the introduction of the structural similarity posite outcome.
index (SSIM) [15]. The SSIM method is de- Taking these factors into account, this study
signed to mimic human perception by focusing empirically selects a moderate detection thresh-
on brightness, contrast, and structural similari- old and further applies a denoising process to the
ties between images. In our study, we first con- binary image to mitigate potential noise. The
vert input and reconstructed images to grayscale determination of the optimal detection thresh-

7
old is not discussed in this paper. Considering Optimizer Weight Decay Learning Rate
that the most notable difference between noise Adam 1 × 10−7 1 × 10−3
and genuine anomalies lies in the area size, the Batch Size No.Epoch Erec (%)
16 1200 0.26
denoising process is implemented as follows: an
area value is determined, and pixel regions in the
binarized image smaller than this value are con- Table 1: Summary of the training setup for each
sidered noise and subsequently removed. model and the achieved reconstruction error av-
The binarization and noise reduction processes eraged over all segments.
are simultaneously applied to the two similarity
matrices obtained from both matrix subtraction
(MS) and structural similarity index (SSIM) cal- 5.2 Method evaluation
culations to generate two binarized images. By In addition to reconstruction accuracy, it is also
taking the union of these two images on a pixel- crucial to evaluate the performance of the pro-
by-pixel basis, the resultant image serves as the posed method to effectively detect anomalies. To
final anomaly detection output. this end, an evaluation plan was developed us-
ing the 9 spare images from the original dataset.
5 Experimental Results These images were artificially edited to simu-
late a range of potential future deterioration and
damages patterns that might occur on the stele
5.1 Training details surface. By artificially creating these anomalies,
In the present study, we acquire high-definition we could ensure representativeness in our test-
normal images of stele surface collected over a ing conditions. Moreover, this approach made it
period of six-month as initial data. After elimi- easier to compare results within the same surface
nating instances when the camera malfunctioned texture, providing a more controlled evaluation
or when foreign objects obstructed the view, a environment.
total of 283 usable images remained. Thanks The artificially introduced anomalies covered
to image augmentations, the scale of the data is 7 categories: carving, crack, moss, doodle, salt,
enhanced to 849. We use 840 images as training water stains and bird dropping. They are de-
dataset while the rest 9 were utilized to create signed considering the diversity both in bright-
artificially anomalous conditions, serving as test ness, color, and structural patterns. From each
images for method assessment. of the nine normal images, seven artificially
The training details are summarized in Tab .1. anomalous images were created, resulting in a
Once we have trained a model for a region, total of 63 anomaly images for method evalua-
we first evaluate the reconstruction accuracy by tion. Thus, coupled with the original 9 normal
computing the relative l2 norm error as: images, the evaluation dataset contains a total
of 72 images.
||y − ỹ||2
Erec = × 100%, (9) The test results of the 8 images are shown in
||y||2 Fig. 4, including the raw output, the interme-
where y denotes the ground truth and ỹ de- diate results of the post-processing steps, and
notes the prediction, respectively. In the Tab .1, the comparison of the final detection results with
we also report the reconstruction accuracy aver- Ground Truth.
aged over all 6 regions, which is denoted in red. Our evaluation reveals that in the region out-
Our proposed model achieves 99.74% accuracy side the anomalies, the raw outputs are strik-
on whole test dataset, indicating the promising ingly similar to the input images, indicating the
reconstruction performance of the models, which model’s impressive reconstruction capabilities.
paves the way for further post-processing. The result of the reconstruction of the anoma-

8
Reconstruction Post processing
Water stain
(viii)
Bird dropping
(vi) (vii)
Salt
Crack
(v)
Carving
(iv)
Moss cover
(iii)
Doodle
(ii)
Normal
(i)

Input img Reconstructed img Img registration Similarity measurement Binarization and Denoising Detection result Ground Truth
& Color matching MS SSIM MS SSIM
(a) (b) (c) (d) (e) (f) (g) (h) (j)

Figure 4: The experiment results obtained from testing normal and abnormal images. The types
of the image are denoted at the left panel, and the processes of the proposed method are denoted
at the bottom, respectively.

lies is consistent with the original texture of the these steps. The subtle improvements they bring
stele surface. Particularly, in contrast to (i.b), may play a crucial role in guaranteeing the ac-
the normal image (i.a) was almost completely curacy of the similarity measure. To quantify
reconstructed. their effect on the model’s reconstruction results
Although high reconstruction accuracy makes and better evaluate their contribution, we aim
the enhancements offered by image registration to conduct a more detailed examination in fu-
and color matching(c) virtually indiscernible to ture studies.
the unaided eye, it is essential not to discount The similarity measurement results are graph-

9
ically represented as heatmaps, where darker ods excel in normal(i) detection, yielding no false
hues denote larger differences(d)(e). Clearly, alarms, demonstrating their precision and relia-
the areas with anomalies showcase significantly bility.
larger reconstruction errors compared to the rest
of the regions, allowing for a well-defined out-
lining of these anomalies. However, an unavoid- 6 Conclusion
able element of noise is present within the results
generated by both the Matrix Subtraction(MS) The present study introduces a deep-learning
and Structural Similarity Index(SSIM). Since method for real-time automatic detection of nat-
the values of these noises are close to the anoma- ural deterioration and human damage on an-
lies, they can negatively affect the binarization cient stone steles using Longmen Grottoes as
process by blurring the distinction between the an example. Utilizing the model architecture
anomaly-induced differences and noise. of auto encoder (AE) and generative adversar-
The binarization results(f)(g) stem from the ial network (GAN), the proposed method offers
empirically selected thresholds for Matrix Sub- the ability to eliminate the requirement for ex-
traction (MS) and Structural Similarity Index tensive anomaly samples while maintaining sen-
(SSIM), including the respective detection and sitivity towards unpredictable anomalies, which
noise reduction thresholds. This paper doesn’t eventually addresses the limitations in existing
delve into discussion on the selection of these deep learning methods for heritage deterioration
thresholds. From the results, it is evident that recognition.
MS and SSIM exhibit variable proficiency in The proposed model achieves a reconstruction
detecting different types of anomalies. Both accuracy of 99.74% with small architecture and
salt(vi) and bird droppings(vii) are proficiently dataset, which indicates the promising perfor-
detected by both methods, largely due to their mance of the proposed model. Regarding post-
distinctive brightness, color, and structural dif- processing, the similarity measurement strategy
ferences compared to the stele surface. On the combining Matrix Subtraction (MS) and Struc-
other hand, the differences between carvings(iv) tural Similarity Index (SSIM) comprehensively
and cracks(v) and the stele surface are mainly covers the differences between the input and re-
structural, making SSIM a more effective detec- constructed images in terms of brightness, color,
tion method for these types of anomalies. Con- and texture structure. By choosing appropriate
versely, in the case of doodle(ii), moss cover (iii), thresholds, the binarization and denoising pro-
and water stains(viii), SSIM’s performance falls cess are able to accomplish the two-class distinc-
short due to the high structural similarity of tion between normal and abnormal well, and the
these anomalies with the stele surface, making results can be directly used as the detection re-
them nearly undetectable. While MS maintains sults.
a good detection performance in these cases. In the final method evaluation, all seven types
Comparing the final detection results(h) with of artificially designed anomalies were success-
the ground truth(j), it is observed that after com- fully detected without false alarms to normal
bining the detection results of MS and SSIM, conditions, demonstrating its exceptional preci-
our method successfully detects all 7 types of sion and reliability. Some minor discrepancies
anomalies. There are slightly differences in the in detection, such as partial undetection of cer-
detection results for moss cover(iii) and cracks(v) tain anomalies like moss cover and cracks, pin-
when compared with GT, showing as part of the point areas for future refinement. The proposed
anomalies region is not detected. However, this method provides novel scenarios and thoughts
discrepancy does not significantly impede the ef- for the application of deep learning in the field
fectiveness of the detection. Notably, both meth- of preventive conservation, by demonstrating a

10
tool for risk detection that is superior in both ef- [7] Ergün Hatır, Mustafa Korkanç, Andreas
ficiency and capability. Future research may fur- Schachner, and İsmail İnce. The deep learn-
ther explore the selection of optimal thresholds ing method applied to the detection and
and continue to fine-tune the model for higher mapping of stone deterioration in open-air
accuracy and a wider range of application sce- sanctuaries of the Hittite period in Anato-
narios. lia. 51:37–49.
[8] Ziwen Liu, Rosie Brigham, Emily Rosemary
References Long, Lyn Wilson, Adam Frost, Scott Allan
Orr, and Josep Grau-Bové. Semantic seg-
mentation and photogrammetry of crowd-
[1] Mayank Mishra. Machine learning tech-
sourced images to monitor historic facades.
niques for structural health monitoring of
10(1):27, 2022.
heritage buildings: A state-of-the-art review
and case studies. Journal of Cultural Her- [9] UNESCO. Longmen grottoes - unesco world
itage, 47:227–245, 2021. heritage list. Online, November 30 2000.

[2] Mehmet Ergün Hatir, Mücahit Barstuğan, [10] Tao Xu and Wu Xiu Ding. Research
and İsmail İnce. Deep learning-based weath- on the weathering problems of longmen
ering type recognition in historical stone grottoes. Advanced Materials Research,
monuments. 45:193–203, 2020. 446:1537–1540, 2012.

[3] Safia Meklati, Kenza Boussora, Mohamed [11] FANG Yun, ZHANG Jun-jian, XIA Guo-
El Hafedh Abdi, and Sid-Ahmed Berrani. zheng, ZHOU Wei-qiang, and SU Mei-liang.
Surface damage identification for heritage Application of infrared thermal imaging on
site protection: A mobile crowd-sensing so- seepage probing of fengxian temple in long-
lution based on deep learning. ACM Jour- men grottoes. Geoscience, 27(3):750, 2013.
nal on Computing and Cultural Heritage,
[12] Samet Akcay, Amir Atapour-Abarghouei,
16(2):1–24, 2023.
and Toby P. Breckon. Ganomaly: Semi-
supervised anomaly detection via adversar-
[4] Jianfang Cao, Hongyan Cui, Qi Zhang, and
ial training, 2018.
Zibang Zhang. Ancient mural classifica-
tion method based on improved alexnet net- [13] Geoffrey E Hinton and Ruslan R Salakhut-
work. Studies in Conservation, 65(7):411– dinov. Reducing the dimensionality of
423, 2020. data with neural networks. science,
313(5786):504–507, 2006.
[5] Mayank Mishra, Tanmoy Barman, and
G. V. Ramana. Artificial intelligence- [14] Alec Radford, Luke Metz, and Soumith
based visual inspection system for struc- Chintala. Unsupervised representation
tural health monitoring of cultural heritage. learning with deep convolutional generative
2022. adversarial networks, 2015.

[6] Zheng Zou, Xuefeng Zhao, Peng Zhao, Fei [15] Zhou Wang, Alan C Bovik, Hamid R
Qi, and Niannian Wang. Cnn-based statis- Sheikh, and Eero P Simoncelli. Image
tics and location estimation of missing com- quality assessment: from error visibility to
ponents in routine inspection of historic structural similarity. IEEE transactions on
buildings. Journal of Cultural Heritage, image processing, 13(4):600–612, 2004.
38:221–230, 2019.

11

You might also like