Video Compression by Neural Networks
Video Compression by Neural Networks
net/publication/225718316
CITATIONS READS
2 2,018
3 authors, including:
All content following this page was uploaded by Aurelio Uncini on 19 December 2014.
Introduction
Image and video have been the object of intensive research in the last twenty
years. The diffusion of a large number of compression algorithm leads to the defi-
nition of several standards; two international organization (ISO/IEC and ITU-T)
have been heavily involved in standardization of images, audio and visual data. To
have a complete overview of recent standard and recent trends in visual informa-
tion compression see [45][51][52]; a detailed description of the standard here ad-
dressed is out of the scope of this section, more details can be found in the refer-
ences.
The standards proposed for general purpose still images compression are the
JPEG [46][47] based on a block DCT transform followed by an Huffman or
Arithmetic coding, and the more recent JPEG2000 [48][49][50] based on discrete
wavelet transform and EBCOT coding.
On the video compression side, hybrid schemes that reduce the spatial redun-
dancy by DCT and temporal correlation by motion compensated prediction coding
are used in ITU H.261 [53]. It was designed and optimized for videoconference
transmission over an ISDN channel (a bit rate down to 64 kbit/sec).
H.263 [56] and H.263+ [54] have the same core architecture of H.261 but some
improvements are introduced principally in precision of motion compensation and
in prediction; they allow the transmission of audio video information with a very
low bit rate (9.6 Kb/sec).
Last advances in video coding aim at collecting all the suitable feature previ-
ously used in video compression to develop new standards (still in developing)
that outperform all the just introduced. One of this new algorithm is the H.26L
[77][55].
The first studies of the Moving Picture Expert Group (MPEG) starts in 1988,
they aim at developing new standards for the Audio Video Coding. The main dif-
ference with respect to the other standards is that MPEGs are “open standard”, so
they are not dedicated to a particular application.
MPEG-1 was developed to operate at bit rates of up to about 1.5Mbit/sec for
the consumer video coding and video content store on media like CD ROM, DAT;
it provides important features including frame based random access of video, fast
forward/fast reverse (FF/FR) searches through compressed bit streams, reverse
playback of video and editability of the compressed bit stream. MPEG-1 perform
the compression using several algorithms such as the subsampling of video infor-
mation to match the HVS (human video system), variable length coding, motion
compensation and DCT to reduce the temporal and spatial redundancy
[57][58][59].
MPEG-2 is similar to MPEG-1 but it include some extensions to cover a wider
range of applications (e.g. HDTV and multi channels audio coding). It was de-
signed to operate at a bit rate between 1.5 and 35 Mb/sec. One of the main en-
hancement of MPEG-2 over MPEG-1 is the introduction of syntax for efficient
coding of interlaced video. The Advanced Audio Coding (AAC) is one of the for-
mats defined in the non back-compatible version of MPEG-2; it was developed to
4 Vigliano Daniele; Raffaele Parisi; Aurelio Uncini
perform the multichannel audio coding. The MPEG-2 AAC is based on the
MPEG-2 layer III, some blocks are improved (frequency resolution, joint stereo
coding, the Hoffman coding) and some others like spectral and time prediction
was introduced. The resulting standard is able to perform the coding of five audio
channel [60][61].
From the evolution of the object oriented computer science comes the use of
objects into video compression; this leads the developing of MPEG-4: the video
signal can be considered as composed by different objects, with theirs own shape,
motion and texture representation. Objects are coded independently in order to al-
low direct access and manipulation. The power of this coding approach is that dif-
ferent objects can be coded by different tools with different compression rate; in a
video sequence some parts of the scene could require less distortion but some
other not. The original video is than divided in streams: audio and video stream
are separated, each object have its own stream, as information about object place-
ment, scaling and motion (Binary Format of Scene).
In MPEG-4 Synthetic an Natural sounds are coded in a different ways, the Syn-
thetic Natural Hybrid Coding (SNHC) perform the composition of natural com-
pressed audio and of synthetic sounds (artificial sound are created in real time by
the decoder); MPEG-4 proposes also the division between speech and “non
speech” sound because the first one can be compressed by ad hoc techniques
[62][64][63][65].
In last years the value of information starts to became not only the information
itself but how easy one can access, manage, find, and filter such information.
MPEG-7 formally named “Multimedia Content Description Tool” provide a rich
set of tool performing the description of audio-visual content in multimedia envi-
ronment. The application areas which benefit from audio-video content descrip-
tion are in different fields: from the web search of multimedia content to the
broadcasting media selection, from the cultural services (like art gallery) to home
entertainment, from the journalist application to the more general databases (of
multimedia data) applications [67][68][69][70]. The descriptions provided by
MPEG-7 are independent of the compression method. Descriptions have to be
meaningful just in the context of the considered application, for this reason differ-
ent types of features perform different abstraction levels. MPEG-7 standard con-
sist of several parts, in this section Multimedia Description Schemes, the Visual
description tool and the Audio description tool are detailed
Multimedia Description Schemes (DSs) are metadata structures to describe au-
dio-visual content, it is defined by the Description Definition Language (DDL)
based on XML. Resulting descriptions can be expressed in text form (TeM) or in a
binary compressed form (BiT); if the first one allow human reading and editing,
the second one improve the efficiency in storing and transmission. In this frame-
work are developed tools providing DSs with information about the content and
the creation of the multimedia document and DSs to improve the browsing and the
access to the audio-visual content. Visual description tool performs the description
of visual category like colour, textures, motion, localization, shape and face rec-
ognition. Audio Description tool contains low level tool (e.g. Spectral, temporal
audio feauters descriptions) and high-level specialized tool like musical instru-
Video compression by Neural Network 5
ment timbre, melody description, spoken tools and the one for the recognition and
indexing of general sound. MPEG-7 standard provides also an application to
represents the multimedia content description named “Terminal”; it is important to
underline that the Terminal performs both the downstream and the upstream
transmission involving less or more specific queries from the end user.
The MPEG standards just introduced are interested in processing of the multi-
media content in a physical context and in a semantic one (MPEG-7); they does
not addresses other issues like multimedia consumption, diffusion, copyright, ac-
cess or management rights. Based on the above observation MPEG-21 aims at re-
solving that lack by providing new solutions to access, consumption, delivery,
management and protection process of the different content types.
MPEG-21 is essentially based on two concepts: Digital Item and Users. The
Digital Item (DI) represents the fundamental unit of distribution and transaction
(e.g. video collection, musical album); it is modelled by Digital Item Declaration
(DID): a set of abstract terms and concepts. The Users are every entity (e.g. hu-
mans, communities, society) that interacts with MPEG-21 environment or uses
Digital Items. The management of Digital Items are allowed by the User right to
perform the action [71][72].
Vector quantization
Vector Quantization is a very popular and efficient method for frame image (or
still image) compression, it represents the natural extension of scalar quantization
to n-dimensional space [17][18][19].
6 Vigliano Daniele; Raffaele Parisi; Aurelio Uncini
Codebook
Xk ik
ooo
1 2 N
The learning of the network is based on the evaluation of the minimum distance
between outputs and inputs: the winner is the neurons with the lower value of that
distance. The main advantages in using SOFM with respect to other clustering al-
gorithm (k-means, LBG) include less sensitivity to initialization, better rate distor-
tion performance and faster convergence; moreover SOFM during the learning
grants to update not only the winning class but also the neighboring one, this be-
cause diminishing the chance of winning the competition produce the under-
utilization of neurons.
For more details about the motivation that inspire the use of Self-Organizing
feature map into the codebook designing see[21][22][20]. Suitable properties of
SOFM can be used in performing more efficient codebook design, example are
APVQ (Adaptive Prediction VQ), FSVQ (Finite State VQ) and HVQ (Hierarchi-
cal VQ).
APVQ uses ordered codebooks in which correlated input are quantized in adja-
cent codewords; an improvement in coding gain is obtained by encoding such
codebook index with a DPCM (or some other neural predictor) [23].
FSVQ [24][Foster] introduces some form of memory in static VQ. It defines
states by using the previously encoded vectors, in each state the encoder selects a
subset of codeword of the global codebook; the Side Match FSVQ [29] in which
the current state of the coder is given by the closer side of the upper and left
neighboring vector (the block of the frame image).
In order to perform reduced computational effort, hierarchical structure can be
used. In literature are widely diffused techniques that cascade the VQ encoders in
several ways: two layers structure or hierarchical structures [27] based on topo-
logical information [26] .
Video compression by Neural Network 7
In Vector Quantization framework other Neural applications are the two step
algorithms; in [28] it was proposed an algorithm in which a neural PCA produces
inputs for SOFM performing the VQ.
The algorithm performs two type of Singularity Maps: the Hard one for daily
video sequences and the Soft one for nightly video sequences; moreover it takes in
particular consideration the presence of noise into the original video sequence be-
cause it can produce a more difficult estimation of singularity map.
Singularity Map is obtained labelling, with topological index and greyscale cor-
respondence, the singular point of the border of the frame image. By this way the
whole edge can be transmitted as a sequence instead of as an image.
Singularity Map is the collection of the multiresolution edges of a frame image,
the extraction processing requires special cares because ordinary edge extractors,
like Sobel, broadens the edges map.
For Hard Singularity Map [31] proposes the use of iterative min-max, for the
Soft SM it proposes the CNN (Cellular Neural Networks) that can extract sharpen
edge in almost real time.
Once computed the SM the very low bit rate video compression is performed us-
ing EPWIC (Embedded predictive Wavelet Image Coder [33]), EZW [34] or other
performing wavelet compression techniques.
Motion compensation
Motion compensation (MC) is one of the most performing techniques to reduce
temporal correlation between adjacent frames. It is based on the issue that adjacent
frames can be very similar so highly correlated in a large number of general pur-
pose video applications. In order to reduce this correlation one block in a frame
can be coded as a translated version of one block in a precedent frame, but have to
be transmitted the motion vector too. In this framework only translational motion
is considered.
In motion estimation framework, frames are segmented in macroblock of 16 x
16 pixes composed by 4 block of 8x8 pixels (a reduced block representation error
correspond to finer block but it produce computational overhead). Figure 3 shows
how in coding the block of frame k is computed the “best match block” of previ-
ous frame and than the representation error is coded together with the information
of the “motion vector”
Video compression by Neural Network 9
Several methods have been investigated in order to reduce the estimation error
and in order to fasten the best match research; the so called predictive methods
perform the matching research only towards previous frame, the bidirectional one
consider also the future frames to perform a bidirectional estimation
sι,1
t1 ∆v1
v1 u1 ∆vi vi
Block 1 Φ
v2 u2 ∆v2
Block 2 Φ si,M
Block i
tD
vD uD ∆vD
Block D Φ
These works aim at parallelize the computational flow required by both motion
estimation and compensation; the application of CNNs perform faster and scalable
computations.
Figure 5 shows the cell of the network presented in [36]; it graphically repre-
sent the set of difference equation given by:
xij ( t )
Cx&ij ( t ) = + ∑ Ai , j ;k ,l ykl ( t ) + ∑ Bi , j ;k ,l ukl ( t ) + I (1.2)
R k ,l k ,l
−1 R
A
Outputs y
∫1 C xij
yij
xi , j (0)
B f u ()
Intputs u
In [38][39] CNNs perform fast and distributed operation on frame images. The
mathematical formulation for the network used is the following
x&ij ( t ) = xij ( t ) + ∑ Ai , j ;k ,l ykl + ∑ Bi , j ;k ,l ukl + (1.3)
k ,l k ,l
The motion compensation proposed aims at determining, inside the frame In+k ,
what are the object belonging to the frame In. Considering the frame n+k, the objects
position in the previous frame n are computed by moving each object of the frame n
in a p x q - pixels window and comparing the result with the frame In+k .
The motion research is performed following a spiral trajectory. All the process-
ing operation need to perform this research are made by the CNN fixing some val-
ues to the network parameters such as A, B, Â , B̂ , x, I, u, y.
Video compression by Neural Network 11
σ ij
− Layer 3: The inference layer contains N neurons; the output of each neuron is
3
o j (3) = ∏ oij (2) .
i =1
− Layer 4: The output layer contains only one neuron that perform the centroid
∑
N
j =1
c j o j (3)
defuzzification; it can be expressed as: o (4)
=
∑
N
j =1
o j (3)
o11(2)
( m11 ,σ 11 ) o1(3)
x1 o1(1)
o31(2)
c1
(2)
o2 (1) o1i
x2 oi (3) ci o(4)
o3i (2)
o3(1) cN
x3
o1N (2) oN (3)
( m3 N ,σ 3 N )
o3N (2)
Fig. 6. The fuzzy neural network that perform the human objects refinement.
The free parameters (mij, σij, ci) of the network are trained from foreground and
background blocks, the training algorithm is a combination of SVD based least
square estimator and gradient descendent method (hybrid learning).
Other approach on segmentation by fuzzy neural network are based on the
fuzzy clustering of more complex data structure; data considered to perform the
segmentation are together inter-frame information such as colour, shape, texture
and contour, and intra-frame information such as motion information and object
temporal shape.
In [42] good segmentation results are obtained by a two steps decomposition.
The first step performs the image splitting in subsets, using an unsupervised neu-
ral network; the frame image is than divided into its clusters.
The hierarchical clustering phase reduces the complexity of the object structure
then a final processing based on PCA (eigendecomposition) performs the final re-
finement and provides the final foreground-background segmentation.
Other approaches are based on a sub space representation of the video se-
quence[43]. The algorithm describes video sequences by the minimum set of
maximally distant frames, selected on the base of semantic content, that are still
Video compression by Neural Network 13
able to describe the video sequence (Key Frames); these frames are collected in a
Codebook. The core of the coding system is the Video Key Frames Codebook
definition; it is based on video analysis in the vector space. The creation is per-
formed by an unsupervised neural network, it consists into the storyboarding of
the recorded sequence.
Image feature vectors are used to represent the images into the vector space;
clustering all the images in feature vector space selects the smaller set of Video
Key Frames used for defining the VKC.
The following sections detail two waveform video compression algorithms; the
techniques proposed are based on feed forward and locally recurrent neural net-
works. The generalization of a still image compression approach inspired the
technique of the first section [75]. In this context compression is achieved by find-
ing some transform able to code images with a reduced number of parameters still
representing the original image with a satisfying quality level; this technique is
well known as transform coding [51].
Given the set of coefficient from of a portion of an image or a video frame,
transform coding produces a reduced set of coefficients such that the reconstruc-
tion has the minimum possible distortion. This reduction is possible because most
of the starting block energy is grouped in a reduced number of coefficients that
became representative for the whole block.
The optimum transform coder, in the sense of mean square error, is the one
that, for a fixed quantization, minimize the mean-square distortion of the recon-
structed data; Karhunen-Loève transform respect this constraint.
In the framework of video compression this still image compression technique
is applied jointly with a time decomposition therefore important issues are the
space-time decomposition and the information compression.
Next sessions are dedicated to the image-video preprocessing and to some way
to realize the neural transform coder.
Everyone can see in images uniform color areas, with a poor informative con-
tent, and areas with higher detail levels yielding much more information. There-
fore using different compression ratios on areas with different activity levels,
should provides a better quality on detailed areas, and higher compression ratios
on areas with more uniform values.
Therefore frame images are decomposed in sub blocks that are processed in-
stead of processing the whole image. Such blocks can be divided in subclasses and
coded with different coders to improve the performance of the compression
[14][16][15].
14 Vigliano Daniele; Raffaele Parisi; Aurelio Uncini
In several papers block activity leads the coder to perform more or less com-
pression; the idea consists in using different compression ratios on areas with dif-
ferent activity levels, in order to obtain a good quality on blocks with many de-
tails, and high compression ratios on the blocks with uniform values.
Suitable results can be reached dividing higher activity blocks on the base of
theirs orientation: horizontal, vertical, diagonal. The best performance are ob-
tained with the classification proposed in [15] in which such blocks are grouped
according to nine possible orientations: two horizontal (one darker on the left, one
darker on the right), two vertical, four diagonal and the last shaded.
Figure 7 shows a picture splitted in different size blocks by means of a quad-
tree approach, based on the pixels variance measure: the bigger is the dimension
of the block, the lower is the content detail, and vice versa.
Blocks with the same mask size are characterized by quite the same informa-
tion amount and are grouped in order to perform a processing with the same neural
network. Such Neural Network requires, in learning phase, training sets specific
for the group. In this way each Neural Network is specialized for treating a par-
ticular class of sub-images, all characterized by quite the same activity.
th
N N_1
.
IMG split ..
N N_k
a) b)
VIDEO
Depth
th Activity
(DA)
GOF
pre-processor
I D2 D2 ... D1 GOF
The images contained in every GOF are coded by a set of trained Neural Struc-
tures that will be detailed in dedicated sections. The frame I and the last one inside
the GOF, frame D1, will be coded with a fitted Quad-tree structure as can be seen
in figure 9. For each sub-block of the keyframe I and of the frame D1 (first and
last frames of the GOF) have to be transmitted, in addition to the compressed data
content, information on the quad tree segmentation and network to be used for the
specific sub-blocks, on the coding of the sub-block mean value, about the quanti-
zation and concerning the number of frames internal to the GOF.
Sub blocks of D2 (the residual frame of the GOF) only require the information
about the compressed data because they have the same segmentation of D1.
16 Vigliano Daniele; Raffaele Parisi; Aurelio Uncini
...
I D2 D1
Fig. 9. QT schemes applied to the pictures within the GOF
The advantage in using the D2 frames resides in fact that, frames near to D1
frame, with high confidence level, have the same quad-tree segmentation structure
as in figure 9; moreover such images are constituted for large parts of wide uni-
form color areas, therefore the mask applied to them will be principally consti-
tuted by large size blocks (e.g. 16x16) going to reduce the value of the bit-rate.
Figure 10 presents a blocks scheme of the processing from the video sequence
to the compressed one.
QT Neural channel
GOF segmentation Coder
Preprocessor
Coding block
Controller
Fig. 10. Blocks scheme for the neural quad-tree video coding
The Video preprocessor, given the original video stream, establishes the value
of the Depth Activity (DA); the GOF preprocessor performs the differences be-
tween frames ; the Controller selects keyframes and frames D1 and D2 which
have to be segmented in different ways; the coding block after segmenting each
frame into blocks, performs neural coding for each group of input block.
(loose of visual quality) depending by the variance of the eigenvalues of the dis-
carded eigenvector.
In detail, given an N-dimensional vector signal x ( n × n pixels image), Kar-
hunen - Loève transform represents it in the orthogonal space of the eigenvector of
its autocovariance matrix (squared N × N matrix); assuming W as the N × N
change basis matrix, no compression is performed [21]:
y = Wx (1.4)
The vector ŷ represents the projection of the vector signal x on a subspace
spanned by reduced number of eigenvectors ( m < N ).The representation error is
not larger than the sum of the squared eigenvalues corresponding to the not chosen
eigenvectors. Considering a change basis matrix W with rows ordered in such a
way to minimize reconstruction error the result is a vector ŷ , provides the better
approximation of x given the subspace dimension [1].
M <N
ˆ =
yˆ =Wx ∑ wx
i =1
i (1.5)
x1
a1
x2
y
xm
aN
xN
x2 ŷ
yM
xM
yN
xN
The generalized Hebbian Algorithm, inside the learning, include also the or-
thogonalisation as shown in the following equation:
A [ n + 1] = A [ n ] + µ yxT − LT yyT A [ n ] (1.8)
where the LT is the operator that set to zero all the element above the diagonal.
The generalized Hebbian learning provide the matrix A with the first M principal
directions. An alternative approach in providing the first M principal component
Video compression by Neural Network 19
x2 ŷ
yM
xM
yN
xN
Also his architecture has a biological justification. The m-th principal compo-
nent can be computed on the base of the previous m-1; the c component are called
anti Hebbian.
An exhaustive explanation about the Hebbian learning and about the algorithms
inspired to its come out of the score of this section so for the learning rule of
APEX network see[74].
with a quite low distortion level uniform blocks but with higher distortion the less
uniform one.
aij bij
x1 x̂1
x2 x̂2
x3 x̂3
xN xˆ N
yˆ M ŷ1
Fig. 14. Multi Layer perceptron performed for Autoassociative Back Propagation.
To overcome this problem Size Adaptive Networks [6] uses different trained
networks to perform a compression which strongly depend by the block activity.
This allows higher compression of blocks with a low activity level but and a good
detail recovery of the block with an higher one.
As is introduced in the dedicated section the a quad-tree algorithm segment im-
ages into several dimension block, on the base of activity level. By this way im-
ages are segmented in blocks of 4 x 4, 8 x 8 and 16 x 16 as shown in figure 15.
Quad Tree
segmentation
NN 16x16 //
Neural
NN 8x8 //
Decoder
NN 4x4 //
put layers has so mach neuron as pixels are into the block. The output of each neu-
ron is quantized with 4 bit.
It should be noted that learning is an important issue in this kind of application.
In order to improve the learning capability advances in adapting sigmoidal func-
tion has been developed; in neural networks spline adaptive models are used in-
stead of fixed sigmoidal functions [8].
Performance of the video compression are usually valuated on the base of
Peack Signal to Noise Ratio (1.11) calculated as a the MSE in dB ad bit rate (in
kbit/sec).
2562
PSNR = 10 ⋅ log10 (1.11)
1 M N
M × N ⋅ ∑∑ pixorg (m, n) − pixcomp (m, n)
2
m =1 n =1
Figure 16 shows several PSNR values obtained compressing Missa benchmark
by processing GOFs with different threshold levels.
i Missa.avi
a)
Missa.avi ; th = 15 Missa.avi ; th = 8
37 37
16
10
36 14
36
12 8
35
PSNR dB
PSNR dB
10 35
GOF
GOF
6
34
8
34
6 4
33
4
33
32 2
2
31 0 32 0
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
Frames Frames
PSNR GOF
b) PSNR GOF
c)
Fig. 16. a) Missa avi movie segmented and compressed; b-c) show PSNR considering the
GOF evolution with two different level of threshold.
Table 1 shows several PSNR and Br values of the video Missa at the threshold
changing. From table 1 it is easy to see that considering a relaxed value of thresh-
old (an higher value) will produce a gain in compression level (bit rate) but dimin-
ish the quality of the recovered video.
22 Vigliano Daniele; Raffaele Parisi; Aurelio Uncini
Table 1. Peack Signal to Noise Ratios (PSNR) and bit error Rate (Br) on changing thresh-
old of the Missa video.
th = 8 th = 15 th = 30
PSNR PSNR PSNR
Br (kbps) Br (kbps) Br (kbps)
(dB) (dB) (dB)
Missa 34,62 205,63 34,02 166,05 33,02 152,85
1 ooo p2 1 p2 1 ooo p2
Decombiner 1 Nh Nh
Compressor 1 Q
1 Nh 1 Nh
Combiner
1 2 M
each of one is segmented in n × n pixels blocks, the group of blocks are processed
together by the hierarchical structure represented in figure 17.
The Hierarchical Neural Network is not fully connected; it consist of input,
hidden and output layers.
While the input and the output layers are single layer composed by N input
blocks (one for each section of the image) where each block has n2 neurons; the
hidden-layer section consist of tree layers: combiner, compressor and decombiner
layer. The combiner level is not fully connected with the input one.
Although the learning of this structure could be performed by the classical back
propagation the so called Nested training algorithm provides better performances.
NTA is a three phases training, one for each part of the structure:
− OLNN (Outer loop neural network) performs the training of the fully connected
network constructed by input layer, combiner layer and output layer; a standard
back propagation is applied to the structure in which the desired output is equal
to the input; the training set is given by images segmented blocks
− ILNN (Inner Loop Neural Network) performs the training of the hidden fully
connected layers: Combiner, Compressor and Decompressor
− After the OLNN and the ILNN are trained, their weights are used to construct
the overall network
It should be noted that this hierarchical structure perform inter block decorrela-
tion in order to achieve a better compression level. The use of two layer percep-
tron (not hierarchical structure) with spline adaptive activation functions riches the
same performance in term of image quality and compression level requiring a
more simple structure.
In both approaches, given an input at time lag n x[n] , it may influence a the
output at the time lag n-h y[n − h] . In case of asymptotic stability,
y[n − h]/ ∂x[n] this derivative goes to zero when h goes to infinity.
The value of h for which the derivative becomes negligible is called temporal
depth, whereas the number of adaptable parameters divided by the temporal depth
is called temporal resolution.
The architecture used in this context is the IIR-MLP proposed by Back-Tsoi
[10][11] where static synapses are substituted by conventional IIR adaptive filters,
as depicted in figure 18:
IIR Sigmoid
∑
x (jl −1) [ n]
ooo
IIR
In literature there exists several algorithms to train such kind of networks, al-
though a comprehensive framework is still missing.
In [9] it was introduced a very performing algorithm for the learning of the so
called locally recurrent neural networks. It is a gradient rule based on the recursive
back-propagation algorithm.
The learning of the locally recurrent neural network is performed by a new gra-
dient-based on-line algorithm [9], called causal recursive back-propagation
(CRBP); it presents some advantages with respect to the already known on-line
training methods and the well known recursive back propagation. This CRBP al-
gorithm includes the Back Propagation as particular cases [12][13].
Locally recurrent Neural Network is designed introducing an ARMA model on
the site of linear synapses, figure 19 shows the structure of the network. The for-
ward phase at time lag n is described by the following equations evaluated for the
layers l = 1,..., M and the neurons m = 1,.., N l .
L(km
l)
−1 (l )
I km
(l )
ykm [ n] = ∑p =0
(l )
wkm ( l −1)
( p ) xm [n − p] + ∑ vkm
p =1
(l )
( p ) ykm [ n − p ]
(l )
(1.12)
Nl −1 ( l )
xk( l ) [n] = sgm ∑ ykm [n] (1.13)
m =0
Given Φ (l ) [ n ] the set of weights of layer l at the time lag n, the updating rule is:
Φ ( l ) [n + 1] = Φ ( l ) [n] + ∆Φ ( l ) [n + 1 − Dl ] (1.14)
Video compression by Neural Network 25
in which:
ì
ï l=M
ï
ï
0 if
Dl = í M
ïï å max ( L(ni m) -1)
(1.15)
if 1£ l £ M
îïi =l +1 n, m
ï
With ( L(ni m) -1) order of the moving average part of the synapse of the n-th neuron
of the l-th layer relative to the m-th output of the (l-1)-th layer.
w11(1) s1(1) [t ] x1(1)[t ]
x1(0)[t ] + sgm
(2)
w11(0)
(1)
w 12
q-1
(0)
x [t ]
2
(2)
w11(1) y11(2) [t ] x1(2) [t ]
+ sgm
(2)
(2) v11(1)
q-1 w11(2)
q-1
(2)
v11(2) q-1
The CRBP algorithm is computationally simple and the application to the video
compression produce some suitable performances.
The proposed architecture is applied as Neural coder in the Coding Block refer-
ring to the architecture shown in figure 10; so according to the general architec-
ture, this neural coder compresses, receive image block from quad tree segmenta-
tion.
Learning of a locally recurrent neural network for the video compression, is a
delicate issue due to fact that recurrent networks are sensitive to a large number of
factors, as for instance the type of videos content in the training set, or the video
length, or the way in which the examples are presented.
Such sensitivity can compromise the correct learning of the network, altering so
the end results e.g. it could produce artifacts on the restored video. Most common
artifacts are the so called “regularities” and the so called “memory effect”, both of
them can be avoided by a special care into training and designing the structure.
An example of “regularities” is shown in figure 20; it can be avoided by reduc-
ing the length of the video training set.
26 Vigliano Daniele; Raffaele Parisi; Aurelio Uncini
Fig. 20. Regularities effects (into the boxes) in two frames of a video sequences.
a) b)
In this context a lot of advantage can be found by using locally recurrent neu-
rons only in the second layer of the structure of figure 14.
It could be observed, from the examples of figures 20-21 too, that most of arti-
facts are in the “background” of the scene.
Recurrent Neural Networks reach a good learning level in more dynamic part
of the scene but not in the static background sections.
In order to overcome this problem after the segmentation of the scene could be
used an hybrid approach to perform the training of the network.
Since static neural networks perform better more static subscenes this can be
used to perform the compression of the 16 × 16 blocks, the one with the lowest ac-
tivity; recurrent neural network can be used to code 4 × 4 and 8 × 8 block with an
high detail level.
This approach achieve a different processing for the lower and the higher ac-
tivity blocks, not only for the network size but also in structure and learning. Per-
formance obtained with hybrid approach are collected in the table 2.
Table 2. Mean value of bit rate and peak signal to noise ratio reached with three different
kind of neural network all of them are IIR MLP set with the dynamic synapse.
Fig. 22. Frames of the Suzi video compressed and recovered with Suzi_02 (showing no
block effect) network and Suzi_04 (showing block effect).
References
[1] Jiang J (1999) Image compression with neural network – A survey. In: Signal Process-
ing and image communications, vol 14, 1999, pp 737-760.
[2] Hebb D O(1949) The organizazion of behaviour. New York, Wiley, 1949
[3] Dony R D, Hykin S (1995) Neural network approach to image compression. Proc.
IEEE 83, vol 2, February 1995, pp 288-303.
[4] Kohno R, Arai M, Imai H (1990), Image compression using neural network with learn-
ing capability of variable function of a neural unit. In: SPIE vol 1360, Visual Commu-
nication and Image processing ‘90, pp 69-75, 1990.
[5] Cottrel G W, Munro P, Zipser D (1988), Image Compression by back propagation and
examples of extensional programming. In: Sharkey. N. E. (Ed.) Advances in cognition
science (Ablex norwood, NJ 1988).
28 Vigliano Daniele; Raffaele Parisi; Aurelio Uncini
[6] Parodi G, Passaggio F (1994), Size-Adaptive Neural Network for Image Compression.
International Conference on Image Processing, ICIP ’94, Austin, TX, USA.
[7] Namphon A, Chin S H, Azrozullah M (1996), Image compression with a Hierarchical
Neural Network, IEEE Transaction on Aereospace and electronic System, vol 32, No.1
January 1996.
[8] Guarnirei S, Piazza F, Uncini A, (1999) Multilayer Feedforward Networks with Adap-
tive Spline Activation Function, IEEE Trans. On Neural Network, vol 10, No. 3, pp.
672-683.
[9] Campolucci P, Uncini A, Piazza F, Rao B D (1999), On-Line Learning Algorithms for
Locally Recurrent Neural Networks. IEEE Trans. on Neural Network, vol 10, No. 2,
pp 253-271 March 1999.
[10] Back A D, Tsoi A C (1991) FIR and IIR synapses, a new neural network architecture
for time series modelling. Neural Computation, vol 3, pp. 375-385.
[11] Back A D, Tsoi A C (1994) Locally recurrent globally feedforward networks: a critical
review of architectures, IEEE Trans. Neural Networks, vol 5, pp 229-239.
[12] Rumelhart D E, Ton G E, Williams R J, (1986) Learning internal representations by er-
ror propagation, Parallel Distributed Processing: Explorations in the Microstructure of
Cognition, vol 1, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group,
Eds. Cambridge, MA: MIT Press.
[13] Widrow B, Lehr M A, (Sept 1990) 30 years of adaptive neural networks: perceptron,
madaline and backpropagation, Proc. IEEE, vol 78, pp 1415-1442.
[14] Cramer C (1998) Neural Network for image and video compression: A review. Euro-
pean Journal of Operational research pp 266-282.
[15] Marsi S, Ramponi G, Sicuranza L (1991) Improved neural structure for image com-
pression. In: Proceedings of the international conference on acoustic speech and signal
processing Toronto, Ont., IEEE Piscataway, NJ, 1991 pp.2821-2824.
[16] Zheng Z, Nakajiama M, Agui T (1992) Study on image data compression by using
neural network. In: Visual communication and image processing’92, SPIE 1992, pp
1425-1433.
[17] Gray R M (1984) Vector quantization. In: IEEE Acoustic and Speech Signal Process-
ing. Apr. 1984, pp 4-29.
[18] Goldberg M, Boucher P R, Shliner S (1988) Image compression using adaptive vector
quantization. In: IEEE Trans. Communication, vol 36, 1988, pp 957-971.
[19] Nasrabadi N M, King R A (1988) Image coding using vector quantization: A review.
In: IEEE Transaction on communication, vol 36, 1988, pp 957-971.
[20] Nasrabadi N M, Feng Y (1988) Vector quantization of image based upon Kohonen self
organizating features map. In: IEEE Proceeding of international conference of Neural
Networks, S.Diego, CA, 1988, pp.101-108.
[21] Haykin S (1998) Neural Networks: A Comprehensive Foundation. In: Prentice Hall,
06 July, 1998
[22] Kohonen T (1990) The self organizing map. In: Proc. IEEE, vol 78, pp. 1464-1480,
Sept 1990.
[23] Poggi G, Sasso E (1993) Codebook ordering technique for address predictive VQ. In:
Proc. IEEE Int. Conf. Acoustic and Speech and Signal Processing ’93, pp. V 586-589,
Minneapolis, MN Apr. 1993.
[24]Liu H, Yum D J J (1993) Self organizing finite state vector quantization for image cod-
ing. In: Proc. of international Workshop on Application of neural networks in tele-
communications, Hillsdale, NJ: Lawrence Erlbrume Assoc., 1993.
Video compression by Neural Network 29
[25] Forster J, Gray R M, Dunham M O (1985) Finite state vector quantization of wave-
form coding. In: IEEE transaction on information Theory, vol 31, 1985, pp 348-359.
[26] Luttrel S P(1989) Hierarchical vector quantization. In : IEE Proc. (London), vol 136
(Part I), pp 405-413, 1989
[27] Li J, Manicopulos C N (1989) Multi stage vector quantization based on self organizing
feature map. In: SPIE vol 1199, visual Communic and Image Processing IV (1989),
pp. 1046-1055.
[28] Weingessel A, Bishof H, Jornik K, Leish F (1997) Adaptive Combination of PCA and
VQ neural networks. In: Letters on IEEE Transaction on Neural Network, vol.8 no. 5,
Sept 1997.
[29] Huang Y L, Chang R F (2002) A new Side-Match Finite State Vetor Quantization Us-
ing Neural Network for image coding. In: Journal of visual Communication and image
reppresentation vol 13, pp 335-347.
[30]Noel S, Szu H, Tzeng N F, Chu C H H, Tanchatchawal S (1999) Video Compression
with Embedded Wavelet Coding and Singularity Maps. In: 13th Annual International
Symposium on Aerospace/Defense Sensing, Simulation, and Controls, Orlando, Flor-
ida, April 1999.
[31] Szu H, Wang H, Chanyagorn P (2000) Human visual system singularity map analyses.
In: Proc. of SPIE: Wavelet Applications VII, vol 4056, pp 525-538, Apr. 26-28, 2000.
[32] Hsu C, Szu H (May 2002) Video Compression by Means of Singularity Maps of Hu-
man Vision System. In: Proceedings of World Congress of Computational Intelligence,
May 2002, Hawaii, USA.
[33] Buccigrossi R, Simoncelli E (Dec. 1999) Image Compression via Joint Statistical
Characterization in the Wavelet Domain. In: IEEE Trans. Image Processing, vol 8, no
12, pp 1688-1700, Dec. 1999.
[34] Shapiro J M (1993) Embedded Image Coding Using Zerotrees of Wavelet Coeffi-
cients. In: IEEE Trans. Signal Processing, vol. 41, no. 12, pp 3445-3462, Dec. 1993.
[35] Skrzypkowiak S S, Jain V K (2001) Hierarchical video motion estimation using a neu-
ral network. In: Proceedings, Second International Workshop on Digital and Computa-
tional Video 2001, 8-9 Feb. 2001 pp 202-208.
[36] Milanova M G, Campilho A C, Correia M V (2000) Cellular neural networks for mo-
tion estimation. In: International Conference on Pattern Recogni-
tion, Barcelona, Spain, Sept 3-7, 2000. pp 827-830.
[37] Toffels A, Roska A, Chua L O (1996) An object-oriented approach to video coding via
the CNN Universal Machine. In: Fourth IEEE International Workshop on Cellular
Neural Networks and their Applications, 1996, CNNA-96, 24-26 June 1996, pp 13-18.
[38] Grassi G, Greco L A (2002) Object-oriented image analysis via analogical CNN algo-
rithms - part I: Motion estimation. In: 7th IEEE International Workshop Frankfurt, Ger-
many 22 - 24 July 2002.
[39] Grassi G, Grieco L A (2003) Object-oriented image analysis using the CNN universal
machine: new analogic CNN algorithms for motion compensation, image synthesis,
and consistency observation. In: IEEE Transactions on Circuits and Systems I, vol
50, no 4 , April 2003, pp 488 – 499.
[40] Luthon F, Dragomirescu D (1999) A cellular analog network for MRF-based video
motion detection. In: IEEE Transactions on Circuits and Systems, vol 46, no 2, Feb
1999 pp 281-293.
30 Vigliano Daniele; Raffaele Parisi; Aurelio Uncini
[60] ISO/IEC 13818-2, Information Technology (2000) Generic coding of Moving Pictures
and Associated Audio Information. Part 2.
[61] Haskell G B, Puri A, Netravali A N (1997) Digital video: an introduction to MPEG-2.
Digital Multimedia, standard Series. In: Chapman & Hall 1997.
[62] ISO/IEC 14496-2:2001 Information Technology. Coding of audio-visual objects. Part
2.
[63] Grill B (1999) The MPEG-4 General Audio Coder. In: Proc. AES 17th International
Conference, Set 1999.
[64] Scheirer E D (1998) The MPEG-4 structured audio Standard. In: IEEE Proc. On
ICASSP, 1998.
[65] Koenen R (2002) Overview of the MPEG-4 Standard-(V.21-Jeju Version). ISO/IEC
JTC1/SC29/WG11 N4668, March 2002.
[66] Aizawa K, T. S. Huang, Model Based Image Coding: Advanced Video Coding tech-
niques for low bit-rate applications. In: Proc. IEEE, vol 83, no 2, Feb. 95.
[67] Avaro O, Salembier P (2001) MPEG-7 systems: Overview. In: IEEE Transaction
on circuit and system for video Tecnology, vol 2. no 6, June 2001.
[68] ISO/IEC JTC1/SC29/WG11 N3933, Jan 2001. MPEG-7 Requiremens document.
[69] Manjunath B S, Salambier P, Sikora T (2002) Introduction to MPEG-7: multimedia
content description language. In: Jhon Wiley & Sons 2002.
[70] Martínez J M, MPEG-7 Overview (version 9), ISO/IEC JTC1/SC29/WG11N5525,
March 2003
[71]Burnett I, Walle R W, Hill K, Bormans J, Pereira F (2003) MPEG-21: Goals and
Achievements. In: IEEE Computer Society, 2003
[72] Bormans J, Hill K (2002) MPEG-21 Overview v5, ISO/IEC JTC1/SC29/WG11
N5231, October 2002
[73] Saupe D, Hamzaoui R, Hartenstein H (1996) Fractal image compression: An introduc-
tory overview. In: Technical report, Institut für Informatik, University of Freiburg,
1996.
[74] Kung S Y, Diamantaras K I, Taur J S (1994) Adaptive Principal component extraction
(APEX ) and application. In: IEEE Trans. Signal. Processing vol 42 (May 1994) pp
1202-1217.
[75] Piazza F, Smerilli S, Uncini A, Griffo M, Zumino R, (1996) Fast Spline Neural Net-
works for Image Compression. In: WIRN-96, Proc. Of the 8th Italian Workshop on
Neural Nets, Vietri sul Mare, Salerno, Italy.
[76] Skrzypkowiak S S, Jain V K (1997) Formative motion estimation using affinity cells
neural network for application to MPEG-2. In: Proc. International Conference on
Communications, pp 1649-1653, June 1997.
[77] ISO/IEC JTC1/SC29/WG11, ITU-T VCEG: working draft number 2 of Joint Video
team standard”.
[78] Topi L, Parisi R, Uncini A (2002) Spline Recurrent Neural Networks for Quad-Tree
Video Coding. In: WIRN-2002, Proc. Of the 13th Italian Workshop on Neural Nets,
Vietri sul Mare, Salerno, Italy, 29-31 May 2002.
32 Vigliano Daniele; Raffaele Parisi; Aurelio Uncini
“Video coding for low bitrate communication,” ITU-T SG XVI, DRAFT 13 H.263+ Q15 A-60
rev. 0, 1997.
Sicuranza G. L., Ramponi G., Marsi S., (1990) “Artificial Neural Networks for Image
Compression”, Electronics Letters, vol. 6, pp. 477-479.
Kiely, A. B. and M. Klimesh, "A New Entropy Coding Technique for Data Compression,"
IPN PR 42-146, April-June 2001, pp. 1-48, August 15, 2001.
[] M J Weinberger, J J Rissanen, R B Arps (1996), Application of universal context
modellingto lossless compression of grey scale images, IEEE Trans. On Image proc-
essing. 5 (4) 1996 pp. 575-586.