Deep-Learning-Based Lossless Image Coding
Deep-Learning-Based Lossless Image Coding
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
1830 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 30, NO. 7, JULY 2020
(1) a new coding approach based on deep-learning and plenoptic cameras are build based on microlens technolo-
context-tree modeling for lossless image coding; gies leading to unfocused [26], [27] (e.g., Lytro cameras) or
(2) a new neural network design for a deep-learning based focused plenoptic cameras [28], [29] (e.g., Raytrix cameras).
predictor for lossless image coding; Microlens technologies enable capturing the light field as a
(3) an efficient context-tree based bit-plane entropy codec; so-called lenslet image, which is a matrix of macro-pixels,
(4) adaptations of the CALIC context modeling procedure whereby each macro-pixel corresponds to a microlens, cov-
for high resolution images and lenslet images; ering N × N pixels in the camera sensor. The macro-pixels
(5) a new strategy for generating binary context trees for a are arranged in the lenslet image according to the position
bit-plane coding strategy; of its corresponding microlens in the microlense matrix. An
(6) an elaborated experimental validation carried out on alternative approach for representing the 4D light field data is
three different types of data, that is: to generate the corresponding set of N 2 subaperture images
(a) UHD photographic images; from the acquired lenslet image. Each subaperture image
(b) lenslet images; corresponds then to a specific camera view captured at a
(c) high-resolution video sequences. specific angle, which is obtained by selecting the pixels located
The remainder of this paper is organized as follows. at the same spatial position in all macro-pixels.
Section II outlines state-of-the-art methods in the fields In recent years, the research community has focused on
of Machine Learning and Lossless Image Compression. offering solutions for compressing plenoptic images. Tradi-
Section III describes the proposed coding approach. The exper- tional methods have proven to be inefficient when applied
imental validation and performance analysis of the proposed to light field data as they fail to account for the specific
coding approach are presented in section IV. Finally, section V macro-pixel structure of such images. In lossless compression,
draws the conclusions of this work. different methods were proposed by taking into account the
plenoptic structure. In [30], the authors propose a predictive
coding method for compressing the raw data captured by
II. S TATE - OF - THE -A RT a plenoptic camera. In [31], each subaperture image in the
Lossless image compression was highly influenced by the RGB representation is encoded relative to a neighboring image
introduction of the Lossless JPEG (JPEG-LS) [6] standard based on a context modeling algorithm. In [32], a sparse
developed by the Joint Photographic Experts Group as an addi- modeling predictor guided by a disparity-based image seg-
tion to the JPEG standard [22] for lossless and near-lossless mentation is employed to encode the set of subaperture images
compression of continuous-tone images. Although an old stan- after applying the RCT color transform from JPEG-LS [6],
dard, (JPEG-LS) [6] maintains its competitive performance which resulted in an increased representation on 9 bits of
thanks to LOCO-I which is a simple, yet efficient prediction the chroma components. In [33], different color transforms
method that uses a small causal neighborhood of three pixels were tested for encoding the set of subaperture images. In
to predict the current pixel. JPEG-LS is well known for its low the lossy compression domain, most of the proposed solu-
complexity which comes from simple residual-error modeling tions are obtained by modifying the HEVC standard to take
based on a Two-Sided Geometric Distribution (TSGD) and into account the plenoptic structure [34]–[37]. Furthermore,
from the use of the Golomb-like codes in the entropy coder. light field compression was the topic of several competitions
The Context-based, Adaptive, Lossless Image Codec or special sessions in the most important signal process-
(CALIC) [7] is a more complex codec, representing the ing conferences [38], [39] where many approaches were pro-
reference method in the literature for lossless encoding of posed. The current state of the art in lossy coding of lenslet
continuous-tone images. In CALIC, the prediction is com- images has recently been proposed in [40]; in this approach,
puted by the Gradient Adjusted Predictor (GAP) which used a macro-pixels were adopted as elementary coding blocks, and
causal neighborhood of six pixels. Moreover, an error context dedicated intra-coding methods based on dictionary learning,
modeling procedure is exploiting the higher-order structures directional prediction and optimized linear prediction ensure
and an entropy coder based on histogram tail truncation is high coding efficiency for this type of data.
efficiently compressing the residual-error. In recent years, the ML domain had gained a lot of
In a more recent work [23], a lossless compression algo- popularity due to its high performance and applicability in
rithm called Free Lossless Image Format (FLIF) was proposed numerous domains. In general, the ML-based solutions are
based on Meta-Adaptive Near-zero Integer Arithmetic Coding attractive since they addresses the modern high-dimensional
(MANIAC), where not just the probability model associated challenges of processing a big amount of data, and they offer
to the local context is adaptive, but also the context model the possibility to simply replace some specific components of
itself is adaptive. For any give image dataset, FLIF currently a working algorithmic solution.
achieves the best compression results [24] compared to the Furthermore, machine learning solutions have benefited
most recent algorithms developed for lossless image compres- of important recent breakthroughs that boosted their per-
sion applications. formance and enabled practical deployments in numerous
Another domain where high spatial resolutions are encoun- domains; these advances include (i) the introduction of the
tered is light field imaging. In this domain, light field images batch normalization concept [41]; (ii) the study of weight
acquired by plenoptic cameras [25] provide both spatial and initialization [42], [43]; (iii) activation functions [44], such as
angular information as a 4D light field data. Consumer-level Rectified linear unit (R E LU) [45], Leaky ReLU [46], etc., and
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
SCHIOPU AND MUNTEANU: DEEP-LEARNING-BASED LOSSLESS IMAGE CODING 1831
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
1832 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 30, NO. 7, JULY 2020
Fig. 3. (a) Residual Learning building block [48] (b) Inception layer [49].
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
SCHIOPU AND MUNTEANU: DEEP-LEARNING-BASED LOSSLESS IMAGE CODING 1833
Fig. 5. The REP-NN network proposed in [20] (leftmost) and the proposed
network designs: R ES LNN, IR ES LNN and IR ES LNN V.
Fig. 4. (a) Dense Block (DB) structure. (b) Convolutional Block (CB) – in branch 2 the input feature map is processed by
structure. (c) Residual Learning based Block (ResLB) structure. (d) Inception a 3 × 3 convolution layer, while in branch 3 it is
and Residual Learning based Block (IResLB) structure. processed by a 5 × 5 convolution layer;
– in branch 2 and branch 3 a preprocessing step,
consisting in a 3 × 3 convolution layer, is introduced
• the Convolution Block (CB), as the block of layers to reduce the number of parameters in the following
containing one convolution layer, followed by one batch convolution layer, having a halved number of filters;
normalization layer and a ReLU layer as depicted in – all the branches in the IResLB structure are having
Figure 4(b). the same output size and are added to obtain the out-
Moreover, we propose two new blocks of layers based on the put as in the R ES L framework, while in the Inception
BN concept and the two ML paradigms, each used as a base layer a filter concatenation step is introduced.
building block for the network designs proposed in this paper.
The following types of building blocks are proposed: IR ES LB is used to build the Inception and Residual
Learning-based Neural Network (IR ES LNN) depicted
(a) The R ES L building block was modified to obtain the
Residual Learning based Block (R ES LB) with the struc- in Figure 5.
ture of layers depicted in Figure 4(c). One may note that Figure 5 depicts the structure of the proposed new network
branch 1 in R ES LB contains an extra 3 × 3 convolution designs as well as our REP-NN network from [20]. The main
layer compared to the R ES L block so that the neural idea in designing each proposed network was to first process
network can further process the residual. R ES LB is used the input image at the initial resolution with one CB block,
to build the Residual Learning-based Neural Network followed by k = 5 blocks of R ES LB or IR ES LB, and then
(R ES LNN) depicted in Figure 5. to reduce the image resolution twice using a sequence of two
(b) The R ES L and Inception concepts were combined to R ES LB blocks with stride 2. The rest of the model shares
obtain the Inception and Residual Learning based Block similarities with our layout in [20], where the final feature
(IR ES LB) with the structure of layers depicted in vector is processed with a sequence of 11 DB blocks. In the
Figure 4(d). CNN-based architectures depicted in Figure 5, one may note
The main ideas used in designing IR ES LB are that the role of the softmax activation function is to classify
summarized as follows: the input patch into one of the 256 classes set by the last dense
– the residual is processed as in R ES LB by employing layer, and that ε̂(x, y) is set as the index of the class with the
a 3 × 3 convolution layer in branch 1; highest probability.
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
1834 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 30, NO. 7, JULY 2020
In this paper, we set b = 15 to obtain the causal neigh- error processing method is employed to process ε(x, y) and to
borhood Nb (x, y) with a resolution of 16 × 31. For these obtain the coded error c (x, y) by exploiting the higher-order
settings, the input image patches with resolution of 16×31 are dependencies between neighboring pixels. The proposed con-
processed using N1 = 32 channels, the first reduced resolution text modeling method is inspired from CALIC’s modeling
of 8×16 is processed using N2 = 64 channels, while the final paradigm [7] and focuses on processing prediction errors of
resolution of 4 × 8 is processed using N3 = 128 channels. high resolution images and lenslet images.
The tests have shown that the sequence of 11 DB blocks The goal of the method is to generate a suitable number of
plays an important role in the reduction of network over-fitting. contexts, without diluting them, and to model the residual error
However, for the case of predicting video sequence frames, such that the entropy coder provides high coding efficiency
an improved performance is obtained by employing a network when encoding c (x, y).
design where the DB blocks are removed and a GlobalMax Section III-B.1 describes the context modeling method
layer is introduced instead, as depicted in Figure 5. This partic- employed for computing the context number assigned to each
ular design was denoted IR ES LNN for video (IR ES LNN V). ε(x, y). Section III-B.2 describes the error modeling method
The goal of the CNN-based predictor is to improve the applied to ε(x, y) to obtain c (x, y).
prediction of the residual-errors ε(x, y). The CNN’s input 1) Context Model: Given the current pixel, I (x, y), let us
ε̄(x, y) is a 9-bit input in the range [−255, 255]. We reduce denote the neighboring pixels as: N = I (x − 1, y), W =
the dynamic range of ε̄(x, y) to an 8-bit representation via a I (x, y − 1), N W = I (x − 1, y − 1), N E = I (x − 1, y + 1),
clipping procedure, as follows: W W = I (x, y − 2), N N = I (x − 2, y), N N E = I (x − 2, y +
(i) set to 127 all the errors larger than 127; 1). Moreover, let us denote the prediction value computed by
(ii) set to −128 all the errors smaller than −128; and GAP [7] as IC AL (x, y).
(iii) add 128 to shift the prediction range to [0, 255]. The method computes the current context based on two
Additionally, we set a number of 256 output classes for the types of information: local texture information and local
networks, that is, ε̂(x, y) will be represented on 8-bits. We note energy. The local texture information, denoted by B, is
that the codec remains lossless as the CNN’s output ε̂(x, y) is obtained under the form of local binary pattern information
further used to compute ε(x, y) based on equation (2), which obtained by comparing IC AL (x, y) with the following vector
is further encoded losslessly. of eight local pattern values C = {N, W, N W, N E, N N,
Note that the range of ε̄(x, y) was reduced because errors W W, 2N − N N, 2W − W W }. Therefore, eight binary values
with large absolute values were of a very low frequency, while are generated and B is computed as the 8-bit number formed
the use of a large number of output classes in the dense layers by concatenating these binary values in the order given by C.
will result in a large number of model parameters and memory The local energy information is obtained by first computing
consumption. the local energy and then by quantizing it by employing the
The proposed network configurations were selected after following procedure:
performing complex testing procedures and are based on the (1) evaluate the strength of the local horizontal edges,
following observations: denoted by dh , and vertical edges, denoted by dv , as:
• the input image patches must be processed as much
dh = |W − W W | + |N − N W | + |N − N E|
as possible at the initial resolution rather than at lower (3)
resolution with the drawback of increasing the memory dv = |W − N W | + |N − N N| + |N E − N N E|
consumption and a more complex network model; (2) compute the error energy estimator, , using the edge
• one CB block must be used in processing the input image information and the neighboring prediction errors as
patch before applying an R ES LB or an IR ES LB block, follows:
as recommended in [49];
• the tests have shown that by processing the feature map = dh + dv + ε(x − 1, y) + ε(x, y − 1); (4)
with a convolution layer with a window size larger than
(3) quantize using the set of quantizers Q =
5 × 5 does not improve the performance.
{5, 15, 25, 42, 85, 60, 140} to obtain a 3-bit value,
In all the convolution layers, the input is padded such that
denoted by Q().
the activation map of each filter has the same size as the input,
In [7], the current context number is set as the 10-bit value
except in the case of the CB block with stride 2.
obtained by setting B as its first 8 bits and Q()/2 as its
In Figure 5, one can notice that there is a large difference
last 2 bits. In this paper, the method from [7] was modified as
between the designs of the proposed neural networks and
follows: (i) the local texture information, B, is computed based
REP-NN, since REP-NN was developed as a sequence of CB
on Iˆ(x, y), instead of IC AL ; (ii) the local energy information is
blocks, while the proposed network designs are processing the
computed as Q() instead of Q()/2; (iii) for lenslet images,
input using a sequence of newly introduced building blocks of
a third component is introduced for computing the current
layers with 2 to 3 branches.
context and it contains the subaperture information.
For high resolution images or video frames, the current
B. Error Context Modeling context is set as the 11-bit value obtained by setting B as
The dual prediction method computes for each pixel posi- the first eight bits and Q() as the last three bits. For lenslet
tion (x, y) the prediction error ε(x, y). In this paper, a complex images, the current context is computed based on a third
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
SCHIOPU AND MUNTEANU: DEEP-LEARNING-BASED LOSSLESS IMAGE CODING 1835
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
1836 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 30, NO. 7, JULY 2020
context tree depth dT until which the tree can grow. Based on
this first strategy, the context is determined immediately for
each symbol by passing only once through the whole image.
A second strategy is to set a larger value for the maximum
context tree depth, go through the whole image once and
gather the counts for each possible node in the context tree,
and finally prune the tree to obtain the optimal context tree. In
this second strategy, although the context is determined only
Fig. 7. Context template showing the position of the causal neighbors used after a second pass through the image, it has the advantage of
for creating the tree and the corresponding tree depth of the node. finding the optimal context tree for encoding the corresponding
sequence of symbols. However, this always implies a trade-off
between algorithmic complexity and algorithmic performance.
In this paper, we used αh = 2 and αv = 4. Next an interme- In this paper, the pruning process is employing the
diary prediction value, k̄ (x, y), is computed as follows: Krichevsky-Trofimov estimator [53] based on a gamma func-
⎧
⎪
⎪ k6 + 1, if sh + sv = 0; tion implementation to compute the codelength estimation for
⎪
⎨ kv +k10 , if s > α ; encoding the sequence of symbols collected at each node.
v v
k̄ (x, y) = kh +k4 . (5) Based on this more complex method, the context is determined
⎪
⎪ 4 , if sh > αh ;
10
⎪
⎩ only the second time the current position is visited.
k10 , otherwise. Both strategies are investigated and two profiles are pro-
posed: (1) the FAST profile, where the 1-pass strategy
The final binary length prediction, k̂ (x, y), is computed 1p
based on the observation that there is a higher chance that is employed using a maximum tree depth dT ; (2) the
k̄ (x, y) is an under-prediction (i.e., k̄ (x, y) < k (x, y)) in SLOW profile, where the 2-pass strategy is employed using
2p
the case of the least significant bits of c (x, y) compared to a maximum tree depth dT .
the case of the most significant bits. Therefore, k̂ (x, y) is
computed as follows: Algorithm 1 Context-Based Bit-Plane Coding, the FAST
Profile
k̂ (x, y) = k̄ (x, y) + δk (k̄ (x, y)), (6) 1) Apply the dual prediction method from Section III-A.2
where δk is updating k̄ (x, y) and it is defined as follows: and compute Iˆ(x, y) using equation (1).
2) Compute the ε(x, y) using equation (2).
2, if k̄ (x, y) < 3 i=k (x,y)
δk (k̄ (x, y)) = . (7) 3) Compute the εc (x, y) = i=0 bi · 2i by employing
1, otherwise the Context Modeling method described in Section III-
2) Context Tree Modeling: In this paper, the proposed codec B.
is utilizing the following set of nine binary context trees: Tξ 4) Compute k̂ (x, y), using equation (6).
is encoding ξ(x, y), and Ti is encoding bi , the i -th bit in the 5) Set ξ(x, y) by comparing k̂ (x, y) with k (x, y).
i=k̂ (x,y) 6) Encode ξ(x, y) as follows:
binary representation of c (x, y) = bi · 2i . Note 1p
i=0 a) Visit the nodes in Tξ from the root and up to dT
that k̂ (x, y) is reducing the number of symbols encoded in
by using the neighbor corresponding to the index
the last bit-planes since at most k̂ (x, y) < 7 bit-planes are
shown in Figure 7, and compute the current context
sufficient to represent c (x, y).
number.
Figure 7 depicts the template context utilized to generate
b) Encode ξ(x, y) using the counts in the current
each of the nine binary context tree. An index, dT , is assigned
context number.
to each causal neighbor, and represents the tree depth at which
c) Update Tξ .
the current node of the context tree is extended based on the
neighbor with index dT . The nodes in Tξ are set based on 7) From i = k̂ (x, y) and down to i = 0, encode each bit
the values of ξ. The nodes in Ti are set as follows: the nod bi as follows:
1p
at the tree depth dT is set 1 if the neighbor with the index a) Visit the nodes in Ti from the root and up to dT
dT (see Figure 7) is represented using at least i bits, and 0 by using the neighbor corresponding to the index
otherwise. Each context tree is used by an adaptive context shown in Figure 7, and compute the current context
tree method [52] where the current symbol is encoded by the number.
binary arithmetic codec corresponding to the context number b) Encode bi using the counts in the current context
computed using the context tree. number.
In this paper, we adopt the concept of halving the node’s c) Update Ti .
symbol counts every time the sum of symbol counts is
exceeding a halving threshold h 1/2 . The proposed method 3) Algorithmic Details: The FAST profile of the proposed
uses an aggressive strategy of halving the counts after coding approach is summarized in A LGORITHM I. The tests
h 1/2 = 127 symbols. have shown that by setting a large three depth the contexts
There are two strategies that can be used when generating are diluted, while by setting a small three depth the num-
the context tree. One simple strategy is to limit the maximum ber of contexts is too small to obtain a good performance.
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
SCHIOPU AND MUNTEANU: DEEP-LEARNING-BASED LOSSLESS IMAGE CODING 1837
1p
The context trees generated for dT = 12 obtains in general
good performance and it was selected for our test, however,
1p
other values up to dT = 30 can be used for different type of
images.
The algorithmic description of the SLOW profile is sum-
marized in A LGORITHM II. The tests have shown that by
2p
setting dT = 18 the proposed coding approach obtains a good
performance in a reasonable runtime.
IV. E XPERIMENTAL E VALUATION
A. Experimental Setup Fig. 8. The study of experimental setups based on different training parameter
In this paper, the experimental validation is carried out variations for the set of 68 4K UHD images: (a) slightly different IResLNN
architectures: 10, 11, and 12 DB blocks; (b) different patch sizes: 4 × 7
on three different types of data: photographic image, lenslet (b = 3), 8 × 15 (b = 7), 12 × 23 (b = 11), and 16 × 31 (b = 15);
image, and video frame. The following datasets are used: (c) different batch sizes: 8, 32, and 4, 000 patches; (d) different training set
(1) The dataset of 68 4K UHD grayscale images randomly sizes: 1M, 5M, and 10M patches.
selected from [54], with a resolution of 3840 × 2160.
(2) The EPFL Light Field dataset [55], available online [56],
which contains 118 unfocused lenslet images captured by
the Lytro camera in the RGB colormap representation.
The resolution of the microlens matrix is 625 × 434 and
the resolution of a macro-pixel is 15 × 15.
(3) The dataset of seven video sequences from the Ultra
Video Group from Tampere University of Technology,
denoted here UVG-TUT, and available online [57]. The
experimental testing is executed on the frame resolution
of 1920 × 1080 and the compression results are reported
Fig. 9. Comparison between the single and dual prediction methods for the
only for the Y channel. set of 68 UHD images: (single) the single-stage prediction method based on
One may note that one grayscale matrix is encoded for the the IResLNN predictor; (dual-Proposed) the proposed dual prediction method,
photographic image case, three color matrices: R, G, B are where the LOCO-I predictor is employed in the first stage and the IResLNN
predictor in the second stage. (a) Relative compression results. (b) Comparison
encoded for the lenslet image case, and one luminance matrix between the absolute error of (single) and (dual-Proposed) prediction methods.
is encoded for the video frame case. Hence, not only that three
different types of data are tested, but also three different types
of image colormap representations. One may note that a neural remind that, in our work, we are using a 90% − 10% ratio
network model must be trained for each type of data, for each for splitting the 10M patches into training − validation data,
color channel, and for each resolution. and the learning rate is decreased progressively as follows. If
The proposed deep-learning based image codec is designed we denote the learning rate at epoch i as ηi , then ηi+1 is set
i
for lossless compression applications and its performance as ηi+1 = ( f d ) ns · ηi , ∀i = 1, 2, . . . , 32, where f d = 0.2 is
is assessed on still pictures, lenslet image data and video the decay rate, n s = 5 is the decay step, and η1 = 5 · 10−4 is
frames. One notes that the compression performance for the the learning rate at the first epoch.
latter type of data can be improved by employing different The above training procedure was proposed after testing
inter-prediction techniques. Adapting the proposed codec to the proposed method in a complex set of experimental setups
employ lossless inter-prediction is beyond the scope of this where different training parameter variations were studied.
paper. Figure 8 show relative compression results (see eq. (8) below)
The proposed neural network models (R ES LNN, for the set of 68 4K UHD images for the IResLNN predictor
IR ES LNN, and IR ES LNN V) were trained during 32 epochs when considering the following training parameter variations:
and using a batch size of 4000 patches of size 16 × 31. (a) slightly different IResLNN architectures (between 10 and
A number of 10 million (10M) patches are randomly selected 12 DB blocks); (b) different patch sizes (between 4 × 7 and
for each type of data from the selected training images. We 16 × 31 patch size); (c) different batch sizes (between 8
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
1838 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 30, NO. 7, JULY 2020
TABLE I
L OSSLESS C OMPRESSION R ESULTS FOR THE T EST S ET (64 P HOTOGRAPHIC I MAGES )
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
SCHIOPU AND MUNTEANU: DEEP-LEARNING-BASED LOSSLESS IMAGE CODING 1839
TABLE II
L OSSLESS C OMPRESSION R ESULTS FOR THE L ENSLET I MAGES F ROM THE EPFL D ATASET [56]
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
1840 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 30, NO. 7, JULY 2020
TABLE IV
B ITRATE R ESULTS AND I MPROVEMENT (%) C OMPARE TO L OSSLESS HEVCI NTRA FOR THE UVG-TUT D ATASET [57]
• CBPNN has an improved performance of 13.7% over the (2) the JPEG-LS codec [6];
MP-CNN predictor [21]; (3) the FLIF codec [23];
• CBPNN outperforms the JPEG-LS codec with 35.4%; (4) the CALIC codec [7];
• CBPNN outperforms the CALIC codec with 31.3%. (5) CBPNN V running under the FAST profile;
• CBPNN outperforms the FLIF codec with 10.6%. (6) CBPNN V running under the SLOW profile.
3) Results on Video Frames: The IR ES LNN V model was Table IV shows the compression results of the five methods
trained to predict the Y channel of the frames in the in bpp, and the improvement compare to Lossless HEVCIntra,
UVG-TUT dataset [57] with video resolution of 1920×1080. denoted C R and computed for the MX method as follows:
The set of 10M patches was selected from the set of B RMX
15 training video sequences presented in Table III, available C R = 1 − . (10)
B RLosslessHEVCIntra
online [59]. An equal number of patches was allocated for
each sequence and 4 frames were randomly selected from each Note that the best and second best performance is marked
sequence. Therefore, only 8.03% of the patches found in the with bold. One can notice that the proposed codec CBPNN V
training dataset were collected for training. has an improved average performance compare to the state-
The video sequences used for training are completely dif- of-the-art methods. Lossless HEVCI NTRA is outperformed by
ferent than the video sequences used for testing. They were CBPNN V with the FAST profile with 19.82% and with the
acquired with a different generation of camera sensors and are SLOW profile with 20.12%.
showing different type of content, compared to the UVG-TUT
dataset. The UVG-TUT dataset was captured using a camera C. Complexity Analysis
sensor developed based on latest technologies and it contains The goal of this paper was to propose a new coding
seven video sequences with a better video quality than the 15 approach which employs deep learning-based prediction. The
training video sequences from [59]. proposed neural network design was developed with the goal
The set of 10M patches was collected based on the idea that of obtaining improved compression results compared to the
it must contain patches from all available video sequences, state-of-the-art algorithms.
having the target resolution of the predicted frame. If available, In our experiments, to compute the pixel-wise predic-
we recommend the use an even larger training set. tion for one UHD grayscale image, with a 3840 × 2160
Note that to encode a video sequence having a different res- resolution, the neural network is inferred using a total of
olution than 1920 × 1080, one must train another IR ES LNN V 8, 294, 400 patches. The current inference time on a machine
model using a different set of 10M patches. The set must be equipped with an NVIDIA Titan X GPU is around 12 minutes,
collected from a different set of training video sequences than and depends on the available VRAM memory, the machine’s
the one presented Table III, where each video sequence was RAM memory, the programming language and deep learning
captured at the requested resolution. framework used, and on the software implementation. In this
For encoding video frames, the proposed method is called paper, for the set of 68 4K UHD images, the total inference
CBPNN V and it is based on the proposed coding approach runtime is around 14 hours. The runtime of the proposed CBP
where IR ES LNN V is employed for predicting the residual- entropy codec is negligible compared to the inference time.
error. CBPNN V was tested under both profiles: FAST and One may notice that a deep learning-based solution will
SLOW. The experimental evaluation compares the perfor- always have high runtime when compared to the state-of-the-
mance of the following methods: art algorithms which were specially developed to have a low
(1) Lossless HEVCIntra [9] with the x.265 complexity. However, the runtime for the network inference
implementation [60], configured to run in the lossless can be reduced by using a smaller causal neighborhood or
mode, veryslow preset, and using only intra prediction. by applying specific methods for reducing the complexity of
The following parameters are passed: network inference. In recent years, the research community has
--preset veryslow --keyint 1 offered different solutions such as running a threshold-based
--input-csp 0 --lossless --psnr algorithm by which the filter’s weights are set to zero if they
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
SCHIOPU AND MUNTEANU: DEEP-LEARNING-BASED LOSSLESS IMAGE CODING 1841
are below a threshold, by employing a method for network [10] M. Zhou, W. Gao, M. Jiang, and H. Yu, “HEVC lossless coding and
training which is constraining the filter’s weights to have a improvements,” IEEE Trans. Circuits Syst. Video Technol., vol. 22,
no. 12, pp. 1839–1843, Dec. 2012.
sparse representation, etc. [11] J. Y. Cheong and I. K. Park, “Deep CNN-based super-resolution using
In our future work, we are planning to study how to external and internal examples,” IEEE Signal Process. Lett., vol. 24,
reduce the complexity of network inference. We will study no. 8, pp. 1252–1256, Aug. 2017.
[12] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with
the network performance after applying small changes to the deep neural networks,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
proposed design and means to decrease the complexity without Lake Tahoe, NV, USA, vol. 1, 2012, pp. 341–349.
diminishing the coding performance. [13] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
single image using a multi-scale deep network,” in Proc. Adv. Neural
V. C ONCLUSIONS Inf. Process. Syst., Montreal, QC, Canada, Dec. 2014, pp. 2366–2374.
[Online]. Available: https://papers.nips.cc/paper/5539-depth-map-
The paper proposes a new coding approach for lossless prediction-from-a-single-image-using-a-multi-scale-deep-network
image coding. The approach employs a deep learning-based [14] F. Liu, S. Chunhua, and L. Guosheng, “Deep convolutional neural fields
approach for computing the residual-error for a dual prediction for depth estimation from a single image,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Boston, MA, USA, 2015, pp. 5162–5170.
method and an entropy coder performing context-based bit- [Online]. Available: https://ieeexplore.ieee.org/document/7299152
plane coding to encode the residuals. A new neural network [15] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-based
design built on the ML concepts of R ES L framework and view synthesis for light field cameras,” ACM Trans. Graph., vol. 35,
no. 6, 2016, Art. no. 193.
the Inception architecture was proposed together with a new [16] G. Toderici et al., “Full resolution image compression with recurrent
method for generating binary context trees. Moreover, a state- neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
of-the-art error modeling method was proposed to encode high nit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 5435–5443.
resolution images. The experimental validation is carried out [17] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational
image compression with a scale hyperprior,” in Proc. Int. Conf. Learn.
on three different types of data: photographic image, lenslet Represent. (ICMR), Vancouver, BC, Canada, 2018, pp. 1–23.
image, and video sequences. [18] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network-
The experimental results show that the proposed approach based intra prediction for image coding,” IEEE Trans. Image Process.,
vol. 27, no. 7, pp. 3236–3247, Jul. 2018.
systematically and substantially outperforms state-of-the-art [19] I. Schiopu, Y. Liu, and A. Munteanu, “CNN-based prediction for lossless
methods for all the images and for all the types of data tested: coding of photographic images,” in Proc. Picture Coding Symp. (PCS),
• For the photographic images, the JPEG-LS codec is San Francisco, CA, USA, Jun. 2018, pp. 16–20.
[20] I. Schiopu and A. Munteanu, “Residual-error prediction based on deep
outperformed in average with 59.3%, the CALIC codec learning for lossless image compression,” Electron. Lett., vol. 54, no. 17,
is outperformed in average with 54.8%, and the FLIF pp. 1032–1034, Aug. 2018.
codec is outperformed in average with 45.1%. [21] I. Schiopu and A. Munteanu, “Macro-pixel prediction based on convo-
lutional neural networks for lossless compression of light field images,”
• For the lenslet images, the JPEG-LS codec is outper-
in Proc. 25th IEEE Int. Conf. Image Process. (ICIP), Athens, Greece,
formed in average with 35.4%, the CALIC codec is Oct. 2018, pp. 445–449.
outperformed in average with 31.3%, the FLIF codec is [22] Digital Compression and Coding of Continuous Tone Still
outperformed in average with 10.6%. Images—Requirements and Guidelines, Standard ITU Rec. T.81,
ISO/IEC 10918-1, Sep. 1993.
• For the video frames, the HEVC standard is outperform [23] J. Sneyers and P. Wuille, “FLIF: Free lossless image format based on
in average with 20.12% on the UVG-TUT dataset. MANIAC compression,” in Proc. IEEE Int. Conf. Image Process. (ICIP),
Phoenix, AZ, USA, Sep. 2016, pp. 66–70.
R EFERENCES [24] J. Sneyers and P. Wuille. FLIF Website. [Online]. Available: https://flif.
info
[1] S. Chandra and W. W. Hsu, “Lossless medical image compression in a [25] G. Lippmann, “Épreuves réversibles donnant la sensation du relief,”
block-based storage system,” in Proc. Data Compress. Conf., Snowbird, J. Phys., vol. 7, no. 4, pp. 821–825, 1908.
UT, USA, Mar. 2014, p. 400.
[26] E. H. Adelson and J. Y. A. Wang, “Single lens stereo with a plenoptic
[2] L. F. R. Lucas, N. M. M. Rodrigues, L. A. da Silva Cruz, and
camera,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 2,
S. M. M. de Faria, “Lossless compression of medical images using 3-D
pp. 99–106, Feb. 1992.
predictors,” IEEE Trans. Med. Imag., vol. 36, no. 11, pp. 2250–2260,
[27] R. Ng, M. Levoy, M. Brédif, G. Duval, M. Horowitz, and P. Hanrahan,
Nov. 2017.
[3] H. Wu, X. Sun, J. Yang, W. Zeng, and F. Wu, “Lossless compression “Light field photography with a hand-held plenoptic camera,” Dept.
Comput. Sci., Stanford Univ., Stanford, CA, USA, 2005, pp. 1–11.
of JPEG coded photo collections,” IEEE Trans. Image Process., vol. 25,
no. 6, pp. 2684–2696, Jun. 2016. [28] A. Lumsdaine and T. Georgiev, “The focused plenoptic camera,” in
[4] V. Trivedi and H. Cheng, “Lossless compression of satellite image sets Proc. IEEE Int. Conf. Comput. Photography, San Francisco, CA, USA,
using spatial area overlap compensation,” in Image Analysis and Recog- Apr. 2009, pp. 1–8.
nition. M. Kamel and A. Campilho, Eds. Berlin, Germany: Springer, [29] C. Perwaß and L. Wietzke, “Single lens 3D-camera with extended depth-
2011, pp. 243–252. of-field,” Proc. SPIE, vol. 8291, pp. 829108-1–829108-15, Feb. 2012.
[5] G. Yu, T. Vladimirova, and M. N. Sweeting, “Image compression [Online]. Available: https://www.spiedigitallibrary.org/conference-proc
systems on board satellites,” Acta Astronautica, vol. 64, pp. 988–1005, eedings-of-spie/8291/829108/Single-lens-3D-camera-with-extended-dep
May/Jun. 2009. th-of-field/10.1117/12.909882
[6] M. J. Weinberger, G. Seroussi, and G. Sapiro, “The LOCO-I lossless [30] C. Perra, “Lossless plenoptic image compression using adaptive block
image compression algorithm: Principles and standardization into differential prediction,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
JPEG-LS,” IEEE Trans. Image Process., vol. 9, no. 8, pp. 1309–1324, Process., Brisbane, QLD, Australia, Apr. 2015, pp. 1231–1234.
Aug. 2000. [31] I. Schiopu, M. Gabbouj, A. Gotchev, and M. M. Hannuksela, “Lossless
[7] X. Wu and N. Memon, “Context-based, adaptive, lossless image coding,” compression of subaperture images using context modeling,” in
IEEE Trans. Commun., vol. 45, no. 4, pp. 437–444, Apr. 1997. Proc. 3DTV Conf., True Vis.-Capture, Transmiss. Display 3D Video,
[8] High efficiency video coding, International Organization for Standard- Copenhagen, Denmark, Jun. 2017, pp. 1–4.
ization, document ISO/IEC 23008-2, ITU-T Rec. H.265, Dec. 2013. [32] P. Helin, P. Astola, B. Rao, and I. Tabus, “Minimum description
[9] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the length sparse modeling and region merging for lossless plenoptic image
high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits compression,” IEEE J. Sel. Topics Signal Process., vol. 11, no. 7,
Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012. pp. 1146–1161, Oct. 2017.
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.
1842 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 30, NO. 7, JULY 2020
[33] J. M. Santos, P. A. A. Assuncao, L. A. da Silva Cruz, L. Távora, [52] I. Schiopu, “Depth-map image compression based on region and
R. Fonseca-Pinto, and S. M. M. Faria, “Lossless light-field compres- contour modeling,” M.S. thesis, Tampere Univ. Technol., Tampere,
sion using reversible colour transformations,” in Proc. 7th Int. Conf. Finland, Jan. 2016, vol. 1360. [Online]. Available: https://tutcris.tut.fi/
Image Process. Theory, Tools Appl. (IPTA), Montreal, QC, Canada, portal/en/publications/depthmap-image-compression-based-on-region-an
Nov./Dec. 2017, pp. 1–6. d-contour-modeling(913b8afb-a1ad-44df-b399-61c7756d5ef9)/export.h
[34] C. Conti, J. Lino, P. Nunes, L. D. Soares, and P. L. Correia, “Improved tml
spatial prediction for 3D holoscopic image and video coding,” in Proc. [53] R. Krichevsky and V. Trofimov, “The performance of universal encod-
19th Eur. Signal Process. Conf., Barcelona, Spain, Aug./Sep. 2011, ing,” IEEE Trans. Inf. Theory, vol. 27, no. 2, pp. 199–207, Mar. 1981.
pp. 378–382. [54] 4K UHD Photographic Images. Accessed: Aug. 25, 2017. [Online].
[35] C. Perra and P. Assuncao, “High efficiency coding of light field images Available: http://www.ultrahdwallpapers.net/nature
based on tiling and pseudo-temporal data arrangement,” in Proc. IEEE [55] M. Rerabek and T. Ebrahimi, “New light field image dataset,” in Proc.
Int. Conf. Multimedia Expo Workshops, Seattle, WA, USA, Jul. 2016, 8th Int. Conf. Qual. Multimedia Exper., Lisbon, Portugal, 2016, pp. 1–2.
pp. 1–4. [56] JPEG Pleno Database: EPFL Light-Field Data Set. Accessed:
[36] D. Liu, L. Wang, L. Li, Z. Xiong, F. Wu, and W. Zeng, “Pseudo- Mar. 1, 2017. [Online]. Available: https://jpeg.org/plenodb/lf/epfl
sequence-based light field image compression,” in Proc. IEEE Int. Conf. [57] Ultra Video Group. Tampere University of Technology. Test Sequences.
Multimedia Expo Workshops, Seattle, WA, USA, Jul. 2016, pp. 1–4. Accessed: Jul. 1, 2018. [Online]. Available: http://ultravideo.cs.tut.
[37] L. Li, Z. Li, B. Li, D. Liu, and H. Li, “Pseudo sequence based 2-D fi/#testsequences
hierarchical coding structure for light-field image compression,” in Proc. [58] Nvidia. Titan X Specifications. Accessed: Sep. 1, 2018. [Online]. Avail-
Data Compress. Conf., Snowbird, UT, USA, Apr. 2017, pp. 131–140. able: https://www.nvidia.com/en-us/geforce/products/10series/titan-x-
[38] T. Ebrahimi, P. Schelkens, and F. Pereira. ICME 2016 Grand pascal.
Challenge: Light-Field Image Compression. Accessed: Mar. 1, 2017. [59] Xiph.Org Foundation. Video Test Media. Accessed: Jul. 1, 2018.
[Online]. Available: https://mmspg.epfl.ch/meetings/page-71686-en- [Online]. Available: https://media.xiph.org/video/derf
html/icme2016grandchallenge_1/ [60] MulticoreWare. x265 Source Code, Version 2.7. Accessed: May 4, 2018.
[39] T. Ebrahimi, F. Pereira, P. Schelkens, and S. Foessela, Grand Challenges: [Online]. Available: https://bitbucket.org/multicoreware/x265/downloads
Light Field Image Coding. Accessed: Mar. 1, 2017. [Online]. Available:
http://www.2017.ieeeicip.org/GrandChallenges.html
[40] R. Zhong, I. Schiopu, B. Cornelis, S.-P. Lu, J. Yuan, and A. Munteanu,
“Dictionary learning-based, directional, and optimized prediction for
lenslet image coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 29,
no. 4, pp. 1116–1129, Apr. 2019.
[41] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep Ionut Schiopu (M’13) received the B.Sc. degree
in automatic control and computer science and the
network training by reducing internal covariate shift,” in Proc. Int.
Conf. Mach. Learn., Lille, France, Feb. 2015, pp. 448–456. [Online]. M.Sc. degree in advanced techniques in systems and
Available: https://dl.acm.org/citation.cfm?id=3045118.3045167 signals from the Politehnica University of Bucharest,
Romania, in 2009 and 2011, respectively, and the
[42] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Ph.D. degree from the Tampere University of Tech-
Statist., Sardinia, Italy, May 2010, pp. 249–256. nology (TUT), Finland, in 2016. From 2016 to 2017,
[43] K. He, X. Zhang, S. Ren, and J. Sun. (Feb. 2015). “Delving deep into he was a Post-Doctoral Researcher with TUT. Since
2017, he has been a Post-Doctoral Researcher with
rectifiers: Surpassing human-level performance on imagenet classifica-
tion.” [Online]. Available: https://arxiv.org/abs/1502.01852 Vrije Universiteit Brussel, Belgium. His research
[44] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learn- interests are the design and optimization of machine
learning tools for image and video coding applications, view synthesis, entropy
ing activation functions to improve deep neural networks,” CoRR,
vol. abs/1412.6830, pp. 1–9, Dec. 2014. [Online]. Available: coding based on context modeling, and image segmentation for coding.
https://arxiv.org/abs/1412.6830
[45] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
Boltzmann machines,” in Proc. 27th Int. Conf. Int. Conf. Mach.
Learn. (ICML), Washington, DC, USA, 2010, pp. 807–814.
[46] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities
improve neural network acoustic models,” in Proc. Int. Conf. Mach. Adrian Munteanu (M’07) received the M.Sc.
Learn. (ICML), Atlanta, GA, USA, 2013, pp. 1–3. degree in electronics and telecommunications from
[47] D. P. Kingma and J. Ba. (Dec. 2014). “Adam: A method for stochastic the Politehnica University of Bucharest, Romania,
optimization.” [Online]. Available: https://arxiv.org/abs/1412.6980 in 1994, the M.Sc. degree in biomedical engineering
[48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for from the University of Patras, Greece, in 1996, and
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. the Ph.D. degree (magna cum laude) in applied
(CVPR), Las Vegas, NV, USA, 2016, pp. 770–778. [Online]. Available: sciences from Vrije Universiteit Brussel, Belgium,
https://ieeexplore.ieee.org/document/7780459 in 2003. From 2004 to 2010, he was a Post-Doctoral
[49] C. Szegedy et al., “Going deeper with convolutions,” in Proc. Fellow with the Fund for Scientific Research Flan-
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Boston, MA, ders (FWO), Belgium. Since 2007, he has been a
USA, 2015, pp. 1–9. [Online]. Available: https://ieeexplore.ieee.org/ Professor with the Department of Electronics and
document/7298594 Informatics (ETRO), Vrije Universiteit Brussel (VUB), Belgium. He has
[50] I. J. Goodfellow et al., “Generative adversarial networks,” in Proc. authored more than 300 journals and conference publications, book chapters,
Adv. Neural Inf. Process. Syst., Montreal, QC, Canada, Jun. 2014, and contributions to standards and holds seven patents in image and video
pp. 2672–2680. [Online]. Available: https://papers.nips.cc/book/advances- coding. His research interests include image, video, and 3D graphics coding,
in-neural-information-processing-systems-27-2014 distributed visual processing, 3D graphics, error-resilient coding, multimedia
[51] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, transmission over networks, and statistical modeling. He was a recipient of the
and K. Kavukcuoglu. (Jun. 2016). “Conditional image generation 2004 BARCO-FWO Prize for his Ph.D. work and several prizes and scientific
with PixelCNN decoders.” [Online]. Available: https://arxiv.org/abs/ awards in international journals and conferences. He served as an Associate
1606.05328 Editor for the IEEE T RANSACTIONS ON M ULTIMEDIA.
Authorized licensed use limited to: Zhejiang Normal University. Downloaded on September 18,2023 at 01:54:39 UTC from IEEE Xplore. Restrictions apply.