RobustGAN 19
RobustGAN 19
Abstract
The goal of video watermarking is to embed a message within a video file in a
way such that it minimally impacts the viewing experience but can be recovered
even if the video is redistributed and modified, allowing media producers to assert
ownership over their content. This paper presents R IVAGAN, a novel architecture
for robust video watermarking which features a custom attention-based mechanism
for embedding arbitrary data as well as two independent adversarial networks which
critique the video quality and optimize for robustness. Using this technique, we are
able to achieve state-of-the-art results in deep learning-based video watermarking
and produce watermarked videos which have minimal visual distortion and are
robust against common video processing operations.
1 Introduction
Video watermarking is a set of techniques that aims to hide information in a video stream in such a
way that it is hard to remove or tamper with, all while preserving the quality and fidelity of the content.
Video watermarking allows content creators to prove ownership after distribution and enables movie
producers to identify leaks by embedding unique identifiers into preview copies of films [2]. Other
examples of applications include everything from identifying copyright infringement and embedding
tags for filtering content to automated broadcast monitoring for commercials.
Effective video watermarking is both invisible and robust. However, existing techniques rarely
achieve both of these goals at once. Invisible watermarking is much more challenging in videos than
in still images because perturbing frames independently may result in highly visible distortions such
as flickering. Also, classical watermarking techniques based on algorithms such as the discrete cosine
transform or discrete wavelet transform are typically not robust to video processing operations like
cropping and scaling. If the leaked video has undergone any of these geometric transformations, the
watermark may be destroyed.
The goal of this paper is to design a deep learning-based, multi-bit video watermarking process that
is both robust and invisible. We are motivated by the recent success of deep learning and adversarial
training methods in data hiding tasks as shown by [32, 28]. In this paper, we propose a novel
architecture that goes beyond the standard convolutional layers and operations used in related deep
learning based systems such as image steganography and watermarking.
Our paper is organized as follows: Section 2 discusses related work in watermarking and steganogra-
phy, Section 3 introduces our approach to video watermarking, Section 4 presents some results on
benchmark datasets, and Section 5 provides additional insights into how our model functions.
3 RivaGAN
In this section, we introduce our model for robust video watermarking. Our goal is to encode a D-bit
data vector, where D ∈ {32, 64}, into an arbitrary video of T frames in such a way that (1) the bit
vector can be reliably recovered given one or more frames of the watermarked video, (2) there are no
visible distortions, (3) the watermark cannot be easily removed by watermark removal tools, and (4)
the watermark is robust against common video processing operations.
To achieve these goals, we design our architecture with two adversaries: a critic which evaluates the
quality of the watermarked video, and an adversary network which attempts to remove the watermark.
These two work with an encoder network which adds the watermark to a video, and a decoder network
which extracts the watermark. We present our architecture in Figure 2 and present details about
various transformations in Section 3.1.
In addition, we also introduce a new mechanism for combining the data and image representations
which is more robust against common video processing operations and is easier to train. Currently,
all existing approaches for deep learning-based data hiding techniques operate by concatenating
the binary data to a feature map derived from the image and apply additional convolution layers to
generate the output. We propose a different attention-based mechanism (shown in Figure 1 which
learns a probability distribution over the data dimensions for each pixel and uses that distribution
to select the bits to pay attention to during the embedding process. This biases our model towards
learning to hide different bits in different objects and textures, making it easier to train and resulting
in robustness against operations such as cropping, scaling, and compression.
2
Spatial Repetition Attention
1 Attention Mask
0 1 0 0 for Bit 1
0 1 0 0
0.9 0.1 0.4 ... ... 0.9
...
...
...
0.4
0.1
0 1 0 0 0.7 0.2 ... ... 0.6
0.1 0.2 ... 0.6
0 1 0 0 x
0 1 0 0
2 Attention Distribution
...
0 1 0 0
for Pixel 1
0.7
0.1 0.2 ... ... 0.6
1 2
Figure 1: This figure shows the difference between what related deep learning-based approaches (left)
to this task use to represent their data and what our attention-based approach (right) uses. Unlike
existing approaches which naively repeat the data cross the spatial dimensions, we learn a probability
distribution over the data for each pixel (e.g. the attention distribution) and use that to generate a
more compact data representation. This operation also has the advantage of being interpretable as an
“attention mask" as we can see what bits each pixel is paying attention to and encourage the model to
pay attention to different bits based on the content of the image.
0
Notation. Let X ∈ RT ×W ×H×C be an tensor and Y ∈ RC be a vector. Then let Cat : (X, Y ) →
0
Φ ∈ RT ×W ×H×(C+C ) be the concatenation of X and Y , where Y is expanded to a T ×W ×H ×C 0
dimensional tensor.
Let ConvD→D0 : X → Φ be a 3D convolutional block that takes an input tensor X ∈ RT ×W ×H×D
and maps it to a feature tensor Φ ∈ RT ×W ×H×D , where T , W and H are the time, width and
height dimension respectively. D and D0 are the depth of features. The convolutional block applies a
1×K ×K convolution kernel where K = 11, followed by a TanH activation and a batch normalization
operation [12].
Let Pool : X → Φ be an adaptive mean pooling operation which takes an input tensor X ∈
RT ×W ×H×D and maps it to a feature tensor Φ ∈ RD by averaging over the T , W , and H dimensions.
0
Let LinearD→D0 : X ∈ R...×D → Φ ∈ R...×D be a linear transformation for the last dimension of
a tensor. So Φ and X are the same size except for the last dimension, which changes from D to D0 .
Finally, let V and V̂ be the original video and the watermarked video, which all have the same length
and resolution T × W × H and use RGB color space. Let M ∈ {0, 1}32 be the 32-bit watermark,
and M̂ be the watermark recovered from V̂ . Let T be the attention module, let E be the encoder, let
D be the decoder, let C be the critic, and let A be the adversary.
3.1 Architecture
Attention. The attention module is a pair of convolutional layers shared between the encoder and
decoder. It takes the source frames, applies two convolutional blocks, and generates an attention
mask of size (T, W, H, D) where D is the data dimension, and (T, W, H) corresponds to the time
and size dimensions. The attention mask functions by allowing the model to use the content of the
image at a particular location to determine which dimensions of the data vector to pay attention to.
As shown in Figure 2, the output is an attention mask where the vector at each pixel can be interpreted
as a multinomial distribution over the D data dimensions. The attention module can be formally
expressed as follows:
a = Conv3→32 (V )
b = Conv32→D (a) (1)
T (V ) = Softmax(b)
The attention mechanism biases the model towards hiding data in textures and objects that are less
affected by these transformations than lower-level features. In this way, it helps encourage robustness
against scaling and compression. Empirically, we find that we are able to achieve faster convergence
3
Attention
Conv Conv Softmax
(T, W, H, 3) (T, W, H, 32) (T, W, H, D) (T, W, H, D)
3 32 32 D dim = 3
Video Attention Mask
Encoder
Conv Conv
(T, W, H, 3) Concat
4 32 (T, W, H, 32) + (T, W, H, 3)
32 3
Video D Video’
Attention (T, W, H, D) x (T, W, H, 1)
Attention Mask
Decoder
Conv Conv
(T, W, H, 3) (T, W, H, 32) (T, W, H, D)
3 32 32 D
Video’
Attention
(T, W, H, D) x (T, W, H, D) D
Attention Mask
Figure 2: This figure shows how the attention, encoder, and decoder modules operate on a tensor
level. The attention module uses two convolutional blocks to create an attention mask, which is
then used by the encoder and decoder modules to determine which bits to pay attention to at each
pixel. The encoder module uses the attention mask to compute a compacted form of the data tensor
and concatenates it to the image before applying additional convolutional blocks to generate the
watermarked video. The decoder module extracts the data from each pixel but then weights the
prediction using the attention mask before averaging to try and recover the original data.
and better performance than with other approaches such as concatenation or multiplication as in [32].
We compare our attention-based approach to these competing approaches in Section 4.
Encoder. The encoder network is responsible for taking a fixed-length data vector and embedding
it into a sequence of video frames. The encoder uses the attention module to generate a compact
data tensor of shape (T, W, H, 1), where the D data dimensions have been reduced to a single real
value using the attention weights, and concatenates this compact data tensor to the image. It then
applies two convolutional blocks and generates a residual mask. We constrain the residual such that
an individual pixel can be perturbed by no more than ±0.01 and add the residual mask to the original
video to generate the watermarked output1 . It can be formally expressed as follows:
= T (V ) × M
a
b = Conv4→32 (Cat(V, a))
(2)
c
= Conv32→3 (b)
V̂ = E(V, M ) = V + 0.01 · TanH(c)
Consider an extreme example: for a given pixel, the attention module generates an attention vector
where all the values are 0 except in the first dimension, which is 1. In this case, the compacted data
vector would simply contain the first bit of the data. This operation allows the encoder to learn to pay
attention to different dimensions of the data conditioned on the content of the image at each pixel.
Decoder. The decoder network is responsible for taking a sequence of video frames and extracting
the watermark. As shown in Figure 2, we (1) attempt to extract all D bits of data from every location
in the video and (2) reuse the attention module from the encoder, performing what we refer to as an
“attention pooling" operation to aggregate over the spatial dimensions and generate a D-dimensional
1
We represent pixel intensities as floating point numbers in the range [−1.0, 1.0] as opposed to the discrete
representation as integers in the set {0, ..., 255}.
4
prediction of the watermark bits. This operation can be formally expressed as:
a
= Conv3→32 (V̂ )
b = Conv32→D (a)
(3)
c
= T (V ) × b
M̂ = D(V̂ ) = Pool(c)
This operation is designed to take advantage of that fact that if the encoder paid a lot of attention to
bit d at a particular pixel, then the value of that pixel is more likely to contain information about bit
d than some arbitrary bit. Therefore we weigh the predictions generated by the decoder using the
amount of attention paid to each bit at each location, and take the average.
We note that the decoder module does not require access to the original source video since the
attention module is applied to the watermarked version of the video; as a result, this decoder satisfies
the criteria for our system to be classified as a blind video watermarking algorithm.
Critic. The critic network is responsible for taking a sequence of video frames and detecting the
presence of a watermark. It encourages the encoder to watermark the video in such a way that the
distortion is less visible and can fool the critic. This module consists of two convolutional blocks,
followed by an adaptive spatial pooling layer and a linear classification layer which produces the
critic score.
Adversary. The adversary network attempts to imitate an attacker trying to remove the watermark.
Specifically, the adversary network is responsible for taking a sequence of video frames and removing
the watermark to generate another sequence of clean video frames. This module closely resembles the
Encoder module without the data tensor. This module consists of two convolutional blocks followed
by a linear layer which generates the residual mask. We then apply a scaled TanH activation function
to constrain the maximum amount by which an individual pixel can be perturbed to ±0.01, and add
the residual mask to the watermarked video to generate the output.
In order to encourage robustness against common video transforms, we apply several noise layers to
the watermarked video before it is passed to the decoder, forcing the encoder and decoder to learn
representations that are invariant to these transforms.
Scaling. The scaling layer is designed to re-scale the video to a random size where the width and
height are between 80-100% of the original. By inserting this noise layer between the encoder and
decoder, we ensure that our model learns to embed data bits in a scale-invariant manner.
Cropping. The cropping layer is designed to randomly select a sub-window that contains 80-100%
of the video frame. By inserting this noise layer between the encoder and decoder, we ensure that our
model learns to embed the data bits with sufficient spatial redundancy that cropping will not remove
the message.
Compression. The compression layer uses the discrete cosine transform (DCT) to provide a dif-
ferentiable approximation of video compression algorithms such as H.264 [25]. By converting the
video into the YCrCb color space, applying the 3D DCT transform, zeroing out 0-10% of the highest
frequency components, applying the inverse DCT transform, and then converting the video back
into the RGB color space, we can force our model to embed watermarks in a compression-resistant
manner.
3.3 Optimization
Loss Functions. In order to train the encoder E and decoder D in our video watermarking model, we
minimize the following loss functions. The cross-entropy loss between the bit vector and the decoded
data
Ld = EV,M [CrossEntropy(M, D(E(V, M )))]
The cross-entropy loss between the bit vector and the decoded data after the watermarked video
is processed by the non-differentiable MJPEG compression operation. This operation takes the
5
Table 1: This table shows the results for a model trained to embed 32 bits of data with or without our
attention mechanism. We find that models trained with our attention masking and pooling operations
outperform models trained without it and are significantly more robust against geometric transforms.
The average PSNR of the attention-based models is 42.65 while the average PSNR of the models
without attention is 42.73.
Model MJPEG Cropped Scaled
No Attention 0.595 0.588 0.589
No Attention + Noise 0.973 0.970 0.915
Attention 0.997 0.981 0.985
Attention + Noise 0.997 0.995 0.987
sequence of frames generated by the model, saves it to disk using the MJPEG compression format,
and reads it back for the decoder to process. This loss can be expressed by
L∗d = EV,M [CrossEntropy(M, D(MJPEG(E(V, M ))))]
The realism of the watermarked video according to the critic network
Lc = EV,M [C(E(V, M ))]
The cross-entropy loss between the bit vector and the data that is recovered from the watermarked
video after the adversary has tampered with it
La = EV,M [CrossEntropy(M, D(A(E(V, M ))))]
To optimize the critic C and adversary A modules, we also use the following loss functions. The
Wasserstein loss to distinguish between source and watermarked videos
Lw = EV [C(V )] − EV,M [C(E(V, M ))]
The negative cross-entropy loss to teach the adversary to remove the watermark
Lr = −EV,M [CrossEntropy(M, D(A(E(V, M ))))]
Training Procedure. We optimize these loss functions using the Adam optimizer with an initial
learning rate of 10−3 which is decayed when the loss function plateaus; furthermore, we clip the
critic weights to [−0.1, 0.1] and train our model for 300 epochs. During the training stage, we use
standard data augmentation procedures including random horizontal flipping (where we flip all frames
in a given video) and random cropping (where we select a random sub-image from all frames). We
operate on batches of size N = 12 and our procedure for generating the batches involves selecting
N/2 videos from the training dataset and pairing each video with (1) a randomly generated bit vector
D and (2) the complement of that bit vector D̄. We refer to these paired samples as Hamming vector
pairs to denote that the two bit vectors differ by a single bit. We find that this procedure results in
faster convergence and improves model performance significantly.
6
Table 2: This table shows the video quality and watermarking accuracy when embedding D bits of
random data into videos from the test set. The MJPEG column indicate the accuracy obtained by
the decoder after the video is compressed, saved, and read back. The Cropped column indicates the
accuracy obtained after the video is randomly cropped down to 80% of its original size, compressed,
saved, and read back. Similarly, the Scaled column indicates the accuracy obtained after the video is
randomly scaled down to 80% of its original size, compressed, saved, and read back.
Quality Accuracy
Model D PSNR SSIM MJPEG CroppedScaled
Attention 32 42.71 0.954 0.997 0.981 0.985
Attention + Noise 32 42.61 0.960 0.997 0.995 0.987
Attention + Noise + Critic 32 42.08 0.948 0.998 0.998 0.991
Attention + Noise + Critic + Adversary 32 42.05 0.960 0.992 0.988 0.981
Attention 64 42.20 0.944 0.993 0.980 0.961
Attention + Noise 64 42.22 0.953 0.971 0.966 0.917
Attention + Noise + Critic 64 42.06 0.945 0.991 0.989 0.961
Attention + Noise + Critic + Adversary 64 41.99 0.950 0.983 0.972 0.958
Furthermore, even when noise layers are used, we find that our attention-based models still outperform
concatenation-based approaches.
How effective is our approach? We show some examples of video frames in Figure 3 and note
that the watermarked video does not contain any noticeable artifacts. Our results are presented in
Table 2 which shows our image quality and our ability to recover the watermark for different model
configurations and different video processing operations.
We find that when the watermarked video is transmitted without modification, the receiver is able
to decode the 32-bit watermark with above 95% accuracy in all cases. We note that this low error
rate can easily be compensated for through error correcting codes, allowing our system to be used in
real-world applications. Furthermore, we find that the cropping and scaling noise layers are effective
at encouraging robustness against the corresponding video processing operations. When these layers
are applied, the receiver is able to decode the watermark with approximately 99% accuracy despite
cropping and scaling.
Can humans identify the watermarked video? To further establish the invisibility of our water-
marking scheme, we asked workers on the Mechanical Turk platform to watch a random selection of
videos and try to distinguish the source videos from the watermarked videos. For this experiment, we
generated pairs of source and watermarked videos for all 884 videos in our test set and asked workers
who possessed the “masters" qualification to review each pair and identify which video contained the
watermark.
We present the results of this experi- Table 3: This table shows the detection rate by workers on
ment in Table 3 and note that the hu- Mechanical Turk for a randomly selected subset of test videos
man workers are only slightly better generated by each model.
than random guessing. Furthermore,
we find evidence to suggest that the
critic module reduces the visibility of Model Detection Rate
the watermark as the detection rate for Attention + Noise 0.541
watermarked videos generated by the Attention + Noise + Critic 0.514
critic model is 5% lower than those Attention + Noise + Critic + Adversary 0.515
generated by the baseline models.
5 Additional Insights
What does the watermark look like? Next, we’ll examine where the watermark data is being
hidden by visually inspecting the residual that is generated by the encoder and added to the source
7
Figure 3: This figure shows the watermarked video (top) and the residual masks (bottom). The
residual masks were generated by the encoder module and added to the source video to produce the
watermarked video.
Figure 4: This figure shows the original source video and two examples “difference masks" for the
first and second bit of the data tensor. Bright regions indicate that flipping a single bit caused that
pixel to change in the watermarked output. The three images on the top correspond to a model trained
with the attention mechanism and we note that the two difference masks look significantly different.
The three images on the bottom correspond to a model trained without the attention mechanism and
the two difference masks are virtually identical.
video. Figure 3 shows an example of a source video and the corresponding residuals. We note that
the residual values appear to be fairly evenly distributed across the frame.
How does changing a single bit change the watermark? Finally, we examine the impact of flipping
a single bit in the data tensor by examining the resulting “distance mask". We compute each distance
mask by taking a fixed data tensor D1 , embedding it in the image to generate a watermarked video W1 ,
changing a single bit in D1 to create D2 , and embedding it in the image to generate a watermarked
video W2 . Then, we visualize the difference between the two watermarked videos |W1 − W2 | to
highlight the regions of the watermarked video are affected by that particular bit.
We perform this process with a randomly selected image for the first and second bits in our data
tensor and present the results in Figure 4. This figure provides evidence to support that our hypothesis
that the attention mechanism allows our model to pay attention to different dimensions of the data
tensor depending on the content of the image as different bits appear to affect different parts of the
watermark. We observe that the difference masks for the two bits are significantly different in the
attention-based model but are not significantly different in the model without attention, suggesting
that this phenomena can be attributed to the attention mechanism.
Do we need to use Hamming vector pairs? In our initial explorations, we trained our model by
iterating over all of the videos in our dataset and training our model to encode and decode a randomly
generated bit vector into each video. Despite experimenting with multiple optimizers, batch sizes,
and learning rates, we found that our model often failed to converge within a reasonable number of
epochs. This is shown in Figure 5 where the model trained with a high learning rate and without the
Hamming vector pairs fail to converge.
8
1.5
Hamming Vector (LR = 0.001)
1
Hamming Vector (LR = 0.0001)
No Hamming Vector (LR = 0.001)
0.9 No Hamming Vector (LR = 0.0001)
1
Test Accuracy
Training Loss 0.8
0.7
0.5
0.6
0.5
0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Wall Clock Time Wall Clock Time
Figure 5: This figure shows the training loss for the same model architecture, learning rate, and
optimizer but trained with and without the bit inverse trick. We find that including the bit inverse
within the same batch results in dramatically faster convergence as well as better model performance.
In order to overcome this instability, we introduced the concept of Hamming vector pairs and found
that the model converges significantly faster and, in the case of a high initial learning rate, achieves
higher test accuracy. We hypothesize that this is due to the fact that the gradients produced by
Hamming vector pairs are less noisy than the gradients produced by a simple random sample.
6 Conclusion
In this paper, we introduced a new class of attention-based architectures for data hiding tasks such
as steganography and watermarking which is superior to existing approaches such as [32] as it (1)
uses less memory, (2) is easier to train, and (3) is robust against common video processing operations
such as scaling, cropping, and compression. We demonstrated the effectiveness of our approach on
the video watermarking task, achieving near perfect accuracy with minimal visual distortion when
hiding an arbitrary 32-bit watermark into video files. Our code is publically available and can be
found online at: https://github.com/DAI-Lab/RivaGAN.
References
[1] H. O. Altun, A. Orsdemir, G. Sharma, and M. F. Bocko. Optimal spread spectrum watermark
embedding via a multistep feasibility formulation. IEEE Trans. on Image Processing, 18(2):371–
387, Feb 2009.
[2] M. Asikuzzaman and M. R. Pickering. An overview of digital video watermarking. IEEE
Transactions on Circuits and Systems for Video Technology, 28(9):2131–2153, Sep. 2018.
[3] Zhila Bahrami and Fardin Akhlaghian Tab. A new robust video watermarking algorithm based
on surf features and block classification. Multimedia Tools and Applications, 77(1):327–345,
2018.
[4] Shumeet Baluja. Hiding Images in Plain Sight: Deep Steganography. In Proc. of the Conf. on
Neural Information Processing Systems (NIPS) , 2017.
[5] S. Biswas, S. R. Das, and E. M. Petriu. An adaptive compressed mpeg-2 video watermarking
scheme. IEEE Trans. on Instrumentation and Measurement, 54(5):1853–1861, Oct 2005.
[6] A. Ferdowsi and W. Saad. Deep learning-based dynamic watermarking for secure signal
authentication in the internet of things. In Proc. of the IEEE Int. Conf. on Communications
(ICC), pages 1–6, May 2018.
[7] Garima Gupta, V. K. Gupta, and Mahesh Chandra. Review on video watermarking techniques
in spatial and transform domain. In Suresh Chandra Satapathy, Jyotsna Kumar Mandal, Siba K.
Udgata, and Vikrant Bhateja, editors, Information Systems Design and Intelligent Applications,
pages 683–691, 2016.
[8] Frank Hartung and Bernd Girod. Watermarking of uncompressed and compressed video. Signal
Processing, 66(3):283 – 301, 1998.
9
[9] Jamie Hayes and George Danezis. Generating steganographic images via adversarial training.
In NIPS, 2017.
[10] Dajun He, Qibin Sun, and Qi Tian. A semi-fragile object based video authentication system. In
Proc. of the 2003 Int. Symposium on Circuits and Systems, volume 3, 2003.
[11] J. R. Hernandez, M. Amado, and F. Perez-Gonzalez. Dct-domain watermarking techniques for
still images: detector performance analysis and a new structure. IEEE Transactions on Image
Processing, 9(1):55–68, Jan 2000.
[12] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training
by Reducing Internal Covariate Shift. arXiv e-prints, page arXiv:1502.03167, Feb 2015.
[13] Nie Jie and Wei Zhiqiang. A new public watermarking algorithm for rgb color image based
on quantization index modulation. In 2009 Int. Conf. on Information and Automation, pages
837–841, June 2009.
[14] S. Kadu, C. Naveen, V. R. Satpute, and A. G. Keskar. Discrete wavelet transform based
video watermarking technique. In Proc. of the Int. Conf. on Microelectronics, Computing and
Communications (MicroCom), pages 1–6, Jan 2016.
[15] N. K. Kalantari and S. M. Ahadi. A logarithmic quantization index modulation for perceptually
better data hiding. IEEE Transactions on Image Processing, 19(6):1504–1517, June 2010.
[16] Haribabu Kandi, Deepak Mishra, and Subrahmanyam R.K. Sai Gorthi. Exploring the learning
capabilities of convolutional neural networks for robust image watermarking. Computers &
Security, 65:247 – 268, 2017.
[17] Ashish M. Kothari and Ved Vyas Dwivedi. Transform domain video watermarking: Design,
implementation and performance analysis. In Proc. of the Int. Conf. on Communication Systems
and Network Technologies, pages 133–137, 2012.
[18] Jung-Soo Lee and Whoi-Yul Kim. A new object-based image watermarking robust to geometri-
cal attacks. In Pacific-Rim Conference on Multimedia, pages 58–64. Springer, 2004.
[19] S. P. Maity and S. Maity. Multistage spread spectrum watermark detection technique using
fuzzy logic. IEEE Signal Processing Letters, 16(4):245–248, April 2009.
[20] Marcin Marszałek, Ivan Laptev, and Cordelia Schmid. Actions in context. In IEEE Conference
on Computer Vision & Pattern Recognition, 2009.
[21] Bijan G. Mobasseri and Domenick Cinalli. Reversible watermarking using two-way decodable
codes. In Proc. of the Int. Society for Optical Engineering, Security, Steganography, and
Watermarking of Multimedia (VI), pages 397–404, 2004.
[22] N. Mohaghegh and O. Fatemi. H.264 copyright protection with motion vector watermarking.
In Int. Conf. on Audio, Language and Image Processing, pages 1384–1389, July 2008.
[23] M. Noorkami and R. M. Mersereau. A framework for robust watermarking of h.264-encoded
video with controllable detection performance. IEEE Trans. on Information Forensics and
Security, 2(1):14–23, March 2007.
[24] S. Pereira, J. J. K. O. Ruanaidh, F. Deguillaume, G. Csurka, and T. Pun. Template based
recovery of fourier-based watermarks using log-polar and log-log maps. In Proc. of the IEEE
Int. Conf. on Multimedia Computing and Systems, volume 1, pages 870–874, June 1999.
[25] Iain E. Richardson. The H.264 Advanced Video Compression Standard. Wiley Publishing, 2nd
edition, 2010.
[26] Mathias Schlauweg, Dima Pröfrock, Benedikt Zeibich, and Erika Müller. Self-synchronizing
robust texel watermarking in gaussian scale-space. In Proceedings of the 10th ACM Workshop
on Multimedia and Security, MM&Sec ’08, pages 53–62, New York, NY, USA, 2008.
ACM.
[27] M. D. Swanson, Bin Zhu, B. Chau, and A. H. Tewfik. Multiresolution video watermarking
using perceptual models and scene segmentation. In Proceedings of International Conference
on Image Processing, volume 2, pages 558–561 vol.2, Oct 1997.
[28] Matthew Tancik, Ben Mildenhall, and Ren Ng. Stegastamp: Invisible hyperlinks in physical
photographs. CoRR, abs/1904.05343, 2019.
10
[29] V. Vukotić, V. Chappelier, and T. Furon. Are Deep Neural Networks good for blind image
watermarking? In Proc. of the IEEE Int. Workshop on Information Forensics and Security
(WIFS), pages 1–7, Dec 2018.
[30] Xinyu Weng, Yongzhi Li, Lu Chi, and Yadong Mu. Convolutional video steganography with
temporal residual modeling. CoRR, abs/1806.02941, 2018.
[31] B. Yann, L. Nathalie, and D. Jean-Luc. A comparative study of different modes of perturbation
for video watermarking based on motion vectors. In Proc. of the 12th Euro. Signal Processing
Conf., pages 1501–1504, 2004.
[32] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. HiDDeN: Hiding Data With Deep
Networks. In Proc. 15th Euro. Conf. on Computer Vision (ECCV) Part XV, pages 682–697,
2018.
[33] Wenwu Zhu, Zixiang Xiong, and Ya-Qin Zhang. Multiresolution watermarking for images and
video. IEEE Trans. on Circuits and Systems for Video Technology, 9(4):545–550, June 1999.
11