A Deep-Learning Method Using Auto-Encoder and Gene
A Deep-Learning Method Using Auto-Encoder and Gene
Abstract
Accurate detection of natural deterioration and man-made damage on the surfaces of ancient
stele in the first instance is essential for their preventive conservation. Existing methods for
cultural heritage preservation are not able to achieve this goal perfectly due to the difficulty of
balancing accuracy, efficiency, timeliness, and cost. This paper presents a deep-learning method to
automatically detect above mentioned emergencies on ancient stone stele in real time, employing
autoencoder (AE) and generative adversarial network (GAN). The proposed method overcomes
the limitations of existing methods by requiring no extensive anomaly samples while enabling
comprehensive detection of unpredictable anomalies. the method includes stages of monitoring,
data acquisition, pre-processing, model structuring, and post-processing. Taking the Longmen
Grottoes’ stone steles as a case study, an unsupervised learning model based on AE and GAN
architectures is proposed and validated with a reconstruction accuracy of 99.74%. The method’s
evaluation revealed the proficient detection of seven artificially designed anomalies and demon-
strated precision and reliability without false alarms. This research provides novel ideas and
possibilities for the application of deep learning in the field of cultural heritage.
Keywords: Deep Learning, Anomaly Detection, Cultural Heritage Conservation,
Generative Adversarial Network, stone stele.
1
servation. Current research primarily employs designed to detect and locate anomalies by com-
supervised learning based on deep neural net- paring the reconstruction difference between ab-
works(DNN) to achieve classification [2], [3], [4], normal and normal samples. Finally, the effec-
object detection [5], [6], and semantic segmen- tiveness of the method was verified by a series of
tation [7], [8] of various forms of deterioration simulated test images containing various types of
and damage. However, these methods entail an anomalies and ambient lighting conditions.
over-reliance on a large number of high-quality
labeled samples, a requirement that greatly lim-
its their application. In the cultural heritage do- 3 Study area
main, samples related to deterioration and dam-
ages are terribly difficult to obtain sufficiently The Longmen Grottoes, situated on both sides of
for supervised deep learning. This is further the Yi River south of Luoyang in China, serve as
complicated by the diversity of the deteriora- the study area for this research, focusing on an
tion and damages, which further escalates the open-air stone stele at the Fengxian Temple. In
total requirement of samples. For example, pat- addition to tens of thousands of Buddha statues
terns of man-made damages such as doodle and and more than 60 stupas, there are about 2,800
carving are visually unpredictable, making it im- steles with inscriptions at the heritage [9].
possible to ensure that they are learned by the
model. This directly hinders the possibility of Fengxian Temple, Longmen Grottoes
2 Research aim
The research aims at develop a deep-learning
method based on auto-encoder(AE) and gen- inscriptions Stone stele
2
highly susceptible to accidental human-induced of the stele regularly from an identical angle,
damage from the proximate interaction. thereby ensuring the continuity and comparabil-
ity of the collected images. Each of these images
bore a resolution of 3840 × 2748 pixels, featur-
4 Proposed methods ing both vertical and horizontal pixel densities
of 96 dpi, and recorded in 24-bit RGB three-
4.1 Overview of the method channel color. This ensures that the majority
of macro-scale surface changes can be accurately
In response to the need for preventive conserva-
documented. The camera’s shooting frequency
tion of cultural heritage, the proposed method
is set at two images per day, one in the morn-
is supposed to automatically track the emer-
ing and one in the afternoon. This is a compro-
gence of both natural deterioration and man-
mise between maximizing data diversity to train
made damage on the stone stele surface. Specif-
the model and minimizing the storage footprint
ically, we expect that these anomalies should
when collecting high-resolution images over long
be accurately located and rapidly reported to
periods of time.
the conservators at the first sign of occurrence,
so that timely intervention can be made before
more serious damage occurs. 4.3 Pre-processing
The scheme of our proposed method is as fol-
lows (Fig. 2). Initially, a fixed camera serves as Pre-processing of the collected images is an inte-
the monitoring instrument, capturing images of gral step to enhance the quality of our training
the stele surface in its normal state. These im- data.
ages then form the training dataset for a model Firstly, noisy data such as severe overexpo-
we developed based on Auto-encoder and Gen- sure and foreign object occlusion are removed by
erative Adversarial Network (GAN) frameworks. manual inspection. This approach contributes
This trained model is capable of reconstructing to improving model accuracy. However, it has
images of the stele surface in its normal state. to be mentioned that when such noise reoccurs
When an image of the stele surface containing in the future, they could potentially trigger false
anomalies is fed to the model in the future, since alarms.
they are never learned by the model, the recon- Subsequently, the original photographs are
structed image and the input image differ signif- cropped and partitioned into six equal-sized re-
icantly where the anomalies are located. Finally, gions. Each of these regions is resized to a
by choosing appropriate methods to measure this 640x480 resolution, maintaining the RGB three-
difference, it can be determined whether the stele channel color, thus preserving the color infor-
is in an emergency situation, and the location of mation vital for anomaly detection. Therefore,
the anomaly will be segmented and presented as the input size for the proposed model is fixed
a binary image. at 640x480x3. For each distinct region, an in-
dependent model will be trained. When de-
ployed, these six models will operate in paral-
4.2 Data acquisition
lel.The above procedure are taken to prevent
The lower center area of the stele has been cho- potential negative impacts such as model train-
sen as monitoring area, considering that most of ing difficulties and overfitting, which might be
the natural deterioration and man-made damage brought about by overly high resolution. For
risks are concentrated here. illustrative purposes, this study only showcases
A high-definition camera fixed to the side of one selected area.
the stele is used as the monitoring device. This Finally, we implement data augmentation by
camera was programmed to capture the surface adjusting the images’ white balance and expo-
3
Monitoring AE/GAN Network
latent-space
latent-space
Encoder
HD camera
(E)
stele surface
Encoder Decoder
Discriminator
Input Output
Data
Normal
Image registration
Data acquisition
Cleaning
Abnormal
Data
Augmentation Color matching
SAFE
WARNING
Figure 2: Flowchart depicting the comprehensive process for anomaly detection on ancient stone
steles. An artificial designed anomaly image with bird dropping is used as an example to show how
the proposed method works.
4
The generator G, following the bow-tie archi-
tecture of autoencoder [13], is designed for learn-
ing latent-space representation z. The input
data from the reference space x ∈ X will be re-
duction of encoder GE and we employ a decoder
GD to project the representation to the reference
space. The second sub-network is the discrimi-
nator D which is used to classify input x and
reconstructed image x̂. This sub-network is the
standard discrimator network introduced in DC-
GAN [14]. The third sub-model is the encoder
network E that compress the reconstructed im-
age x̂. It owns the same architecture of genera-
tor encoder GE while holding different parame-
ters. Note that the encoder E is not taken into
the training process. E compress x̂ to derive
its latent-space features vector ẑ which has iden-
tical dimension as features vector z. We take
the l2 norm error between z and ẑ is a part of
loss function. This sub-network is the unique Figure 3: Schematic pipeline of the proposed
parts of GANonmaly [12], which used for stablise GANomaly architecture. Colored cubes denote
the latent space and imporve the reconstruction different types of layer which are implemented
quality. For encoders and decoders in the archi- for compositing each sub-networks. The gener-
tecture, we adopt convolution neural networks ator is annotated via a black rectangle. Dashed
(CNNs) for learning hierarchical representations lines denote features obtained through hypoth-
from the original image domain, aiming at ex- esis of sub-network while arrows point to corre-
tract the anomaly features from the abnormal sponding loss functions.
image.
The input and output of the model, are the in generator by:
image of normal stele x ∈ Rw×h×3 , where w and
Lcon = Ex∼pX ∥x − G(x)∥1 (1)
h denotes the width and height of the image,
respectively. The input is first propagated to The reconstructed image, so-called fake out-
its encoder GE where data is compressed to a put in GAN model, is fed to the discriminator
vector z ∈ Rd by three blocks comprises CNN as a supervision to penalise the generator. In
layers which used for learning hierarchical rep- another word, the generator is trained for ”fool”
resentation and linear layers. z is known as the the discriminator, thus we introduce adversarial
latent-space features of input data and hypoth- loss to penalise the model during training:
esised to obtain the lowest-dimension represen-
tation of x via z = GE (x). Subsequently, the Ladv = Ex∼pX ∥f (x) − Ex∼pX f (G(x))∥2 (2)
latent space representation z is fed to decoder
of generator GD , implementing transposed CNN Additionally, we employ an extra encoder loss
layers for project the representation back to the Lenc to minimize the distance between the la-
reference space. The function of decoder is to tent features vector of input (z = GE (x)) and
upscale the vector z to reconstruct the image x the encoded features of the generated image
as x̂ ∈ Rw×h×3 via x̂ = GD (z). We propose con- (ẑ = E(x̂)). Through this optimization, the gen-
textural loss to update the trainable parameters erator is able to learn how to encode features of
5
the generated image for normal samples. Lenc is feature-based registration, for this task. SURF
defined as: operates by autonomously identifying key fea-
ture points and then accomplishing registration
Lenc = Ex∼pX ∥GE (x) − E(G(x))∥2 (3) by matching them. This method is particularly
appropriate for our study as it is well-suited in
Overall, our total loss function for generator
cases where there are discrepancies in brightness
becomes the following:
and contrast, meeting the need for robustness
L = ωadv Ladv + ωcon Lcon + ωenc Lenc (4) under this study. The image registration in this
study can be simplified as:
where ωadv , ωcon and ωenc are the weight of
corresponding loss object which adjusts the im- X̂ ′ = SURF(X, X̂) (5)
pact of individual losses to the overall loss func-
Where X and X̂ represent the input image and
tion. In the present study, we adopt ωadv = 1,
the reconstructed image respectively. X̂ ′ is the
ωcon = 40 and ωenc = 1 for each weighting re-
reconstructed image after registration. SURF is
spectively.
the function representing the SURF algorithm.
6
index. σXk , σX̂k , µXk and µX̂k are the average by linearly stacking the three color channels to
and standard deviation of all pixels respectively comply with the SSIM requirements for input.
in a specific channel. Then we compare the difference between patches
rather than individual pixels of the input and
4.5.3 Similarity measurement reconstructed images with SSIM. In the subse-
quently obtained similarity matrix, each pixel
Given the necessity to localize anomalies, simi- corresponds to the similarity value of the patch
larity matrix rather than mere similarity value is pair which takes that pixel as a vertex. The
employed for similarity measurement. Calculat- SSIM for each patch are defined as:
ing the reconstruction error of each pixel directly
by matrix subtraction(MS) is the most straight- 2µXg µX̂ ′′ + C1
forward and efficient method. Specifically, we SSIM(X, X̂ ′′ ) = 2 g
µXg + µ2X̂ ′′ + C1
first subtract the reconstructed image pixel by g
(8)
pixel from the input image and then decenter the 2σXg X̂ ′′ + C2
g
result by the median of each RGB channel. Fi- · 2 2 + C2 ,
σX + σX̂ ′′
nally we linearly stack the three channels to ob- g
7
old is not discussed in this paper. Considering Optimizer Weight Decay Learning Rate
that the most notable difference between noise Adam 1 × 10−7 1 × 10−3
and genuine anomalies lies in the area size, the Batch Size No.Epoch Erec (%)
16 1200 0.26
denoising process is implemented as follows: an
area value is determined, and pixel regions in the
binarized image smaller than this value are con- Table 1: Summary of the training setup for each
sidered noise and subsequently removed. model and the achieved reconstruction error av-
The binarization and noise reduction processes eraged over all segments.
are simultaneously applied to the two similarity
matrices obtained from both matrix subtraction
(MS) and structural similarity index (SSIM) cal- 5.2 Method evaluation
culations to generate two binarized images. By In addition to reconstruction accuracy, it is also
taking the union of these two images on a pixel- crucial to evaluate the performance of the pro-
by-pixel basis, the resultant image serves as the posed method to effectively detect anomalies. To
final anomaly detection output. this end, an evaluation plan was developed us-
ing the 9 spare images from the original dataset.
5 Experimental Results These images were artificially edited to simu-
late a range of potential future deterioration and
damages patterns that might occur on the stele
5.1 Training details surface. By artificially creating these anomalies,
In the present study, we acquire high-definition we could ensure representativeness in our test-
normal images of stele surface collected over a ing conditions. Moreover, this approach made it
period of six-month as initial data. After elimi- easier to compare results within the same surface
nating instances when the camera malfunctioned texture, providing a more controlled evaluation
or when foreign objects obstructed the view, a environment.
total of 283 usable images remained. Thanks The artificially introduced anomalies covered
to image augmentations, the scale of the data is 7 categories: carving, crack, moss, doodle, salt,
enhanced to 849. We use 840 images as training water stains and bird dropping. They are de-
dataset while the rest 9 were utilized to create signed considering the diversity both in bright-
artificially anomalous conditions, serving as test ness, color, and structural patterns. From each
images for method assessment. of the nine normal images, seven artificially
The training details are summarized in Tab .1. anomalous images were created, resulting in a
Once we have trained a model for a region, total of 63 anomaly images for method evalua-
we first evaluate the reconstruction accuracy by tion. Thus, coupled with the original 9 normal
computing the relative l2 norm error as: images, the evaluation dataset contains a total
of 72 images.
||y − ỹ||2
Erec = × 100%, (9) The test results of the 8 images are shown in
||y||2 Fig. 4, including the raw output, the interme-
where y denotes the ground truth and ỹ de- diate results of the post-processing steps, and
notes the prediction, respectively. In the Tab .1, the comparison of the final detection results with
we also report the reconstruction accuracy aver- Ground Truth.
aged over all 6 regions, which is denoted in red. Our evaluation reveals that in the region out-
Our proposed model achieves 99.74% accuracy side the anomalies, the raw outputs are strik-
on whole test dataset, indicating the promising ingly similar to the input images, indicating the
reconstruction performance of the models, which model’s impressive reconstruction capabilities.
paves the way for further post-processing. The result of the reconstruction of the anoma-
8
Reconstruction Post processing
Water stain
(viii)
Bird dropping
(vi) (vii)
Salt
Crack
(v)
Carving
(iv)
Moss cover
(iii)
Doodle
(ii)
Normal
(i)
Input img Reconstructed img Img registration Similarity measurement Binarization and Denoising Detection result Ground Truth
& Color matching MS SSIM MS SSIM
(a) (b) (c) (d) (e) (f) (g) (h) (j)
Figure 4: The experiment results obtained from testing normal and abnormal images. The types
of the image are denoted at the left panel, and the processes of the proposed method are denoted
at the bottom, respectively.
lies is consistent with the original texture of the these steps. The subtle improvements they bring
stele surface. Particularly, in contrast to (i.b), may play a crucial role in guaranteeing the ac-
the normal image (i.a) was almost completely curacy of the similarity measure. To quantify
reconstructed. their effect on the model’s reconstruction results
Although high reconstruction accuracy makes and better evaluate their contribution, we aim
the enhancements offered by image registration to conduct a more detailed examination in fu-
and color matching(c) virtually indiscernible to ture studies.
the unaided eye, it is essential not to discount The similarity measurement results are graph-
9
ically represented as heatmaps, where darker ods excel in normal(i) detection, yielding no false
hues denote larger differences(d)(e). Clearly, alarms, demonstrating their precision and relia-
the areas with anomalies showcase significantly bility.
larger reconstruction errors compared to the rest
of the regions, allowing for a well-defined out-
lining of these anomalies. However, an unavoid- 6 Conclusion
able element of noise is present within the results
generated by both the Matrix Subtraction(MS) The present study introduces a deep-learning
and Structural Similarity Index(SSIM). Since method for real-time automatic detection of nat-
the values of these noises are close to the anoma- ural deterioration and human damage on an-
lies, they can negatively affect the binarization cient stone steles using Longmen Grottoes as
process by blurring the distinction between the an example. Utilizing the model architecture
anomaly-induced differences and noise. of auto encoder (AE) and generative adversar-
The binarization results(f)(g) stem from the ial network (GAN), the proposed method offers
empirically selected thresholds for Matrix Sub- the ability to eliminate the requirement for ex-
traction (MS) and Structural Similarity Index tensive anomaly samples while maintaining sen-
(SSIM), including the respective detection and sitivity towards unpredictable anomalies, which
noise reduction thresholds. This paper doesn’t eventually addresses the limitations in existing
delve into discussion on the selection of these deep learning methods for heritage deterioration
thresholds. From the results, it is evident that recognition.
MS and SSIM exhibit variable proficiency in The proposed model achieves a reconstruction
detecting different types of anomalies. Both accuracy of 99.74% with small architecture and
salt(vi) and bird droppings(vii) are proficiently dataset, which indicates the promising perfor-
detected by both methods, largely due to their mance of the proposed model. Regarding post-
distinctive brightness, color, and structural dif- processing, the similarity measurement strategy
ferences compared to the stele surface. On the combining Matrix Subtraction (MS) and Struc-
other hand, the differences between carvings(iv) tural Similarity Index (SSIM) comprehensively
and cracks(v) and the stele surface are mainly covers the differences between the input and re-
structural, making SSIM a more effective detec- constructed images in terms of brightness, color,
tion method for these types of anomalies. Con- and texture structure. By choosing appropriate
versely, in the case of doodle(ii), moss cover (iii), thresholds, the binarization and denoising pro-
and water stains(viii), SSIM’s performance falls cess are able to accomplish the two-class distinc-
short due to the high structural similarity of tion between normal and abnormal well, and the
these anomalies with the stele surface, making results can be directly used as the detection re-
them nearly undetectable. While MS maintains sults.
a good detection performance in these cases. In the final method evaluation, all seven types
Comparing the final detection results(h) with of artificially designed anomalies were success-
the ground truth(j), it is observed that after com- fully detected without false alarms to normal
bining the detection results of MS and SSIM, conditions, demonstrating its exceptional preci-
our method successfully detects all 7 types of sion and reliability. Some minor discrepancies
anomalies. There are slightly differences in the in detection, such as partial undetection of cer-
detection results for moss cover(iii) and cracks(v) tain anomalies like moss cover and cracks, pin-
when compared with GT, showing as part of the point areas for future refinement. The proposed
anomalies region is not detected. However, this method provides novel scenarios and thoughts
discrepancy does not significantly impede the ef- for the application of deep learning in the field
fectiveness of the detection. Notably, both meth- of preventive conservation, by demonstrating a
10
tool for risk detection that is superior in both ef- [7] Ergün Hatır, Mustafa Korkanç, Andreas
ficiency and capability. Future research may fur- Schachner, and İsmail İnce. The deep learn-
ther explore the selection of optimal thresholds ing method applied to the detection and
and continue to fine-tune the model for higher mapping of stone deterioration in open-air
accuracy and a wider range of application sce- sanctuaries of the Hittite period in Anato-
narios. lia. 51:37–49.
[8] Ziwen Liu, Rosie Brigham, Emily Rosemary
References Long, Lyn Wilson, Adam Frost, Scott Allan
Orr, and Josep Grau-Bové. Semantic seg-
mentation and photogrammetry of crowd-
[1] Mayank Mishra. Machine learning tech-
sourced images to monitor historic facades.
niques for structural health monitoring of
10(1):27, 2022.
heritage buildings: A state-of-the-art review
and case studies. Journal of Cultural Her- [9] UNESCO. Longmen grottoes - unesco world
itage, 47:227–245, 2021. heritage list. Online, November 30 2000.
[2] Mehmet Ergün Hatir, Mücahit Barstuğan, [10] Tao Xu and Wu Xiu Ding. Research
and İsmail İnce. Deep learning-based weath- on the weathering problems of longmen
ering type recognition in historical stone grottoes. Advanced Materials Research,
monuments. 45:193–203, 2020. 446:1537–1540, 2012.
[3] Safia Meklati, Kenza Boussora, Mohamed [11] FANG Yun, ZHANG Jun-jian, XIA Guo-
El Hafedh Abdi, and Sid-Ahmed Berrani. zheng, ZHOU Wei-qiang, and SU Mei-liang.
Surface damage identification for heritage Application of infrared thermal imaging on
site protection: A mobile crowd-sensing so- seepage probing of fengxian temple in long-
lution based on deep learning. ACM Jour- men grottoes. Geoscience, 27(3):750, 2013.
nal on Computing and Cultural Heritage,
[12] Samet Akcay, Amir Atapour-Abarghouei,
16(2):1–24, 2023.
and Toby P. Breckon. Ganomaly: Semi-
supervised anomaly detection via adversar-
[4] Jianfang Cao, Hongyan Cui, Qi Zhang, and
ial training, 2018.
Zibang Zhang. Ancient mural classifica-
tion method based on improved alexnet net- [13] Geoffrey E Hinton and Ruslan R Salakhut-
work. Studies in Conservation, 65(7):411– dinov. Reducing the dimensionality of
423, 2020. data with neural networks. science,
313(5786):504–507, 2006.
[5] Mayank Mishra, Tanmoy Barman, and
G. V. Ramana. Artificial intelligence- [14] Alec Radford, Luke Metz, and Soumith
based visual inspection system for struc- Chintala. Unsupervised representation
tural health monitoring of cultural heritage. learning with deep convolutional generative
2022. adversarial networks, 2015.
[6] Zheng Zou, Xuefeng Zhao, Peng Zhao, Fei [15] Zhou Wang, Alan C Bovik, Hamid R
Qi, and Niannian Wang. Cnn-based statis- Sheikh, and Eero P Simoncelli. Image
tics and location estimation of missing com- quality assessment: from error visibility to
ponents in routine inspection of historic structural similarity. IEEE transactions on
buildings. Journal of Cultural Heritage, image processing, 13(4):600–612, 2004.
38:221–230, 2019.
11