0% found this document useful (0 votes)

42 views76 pages

DeepLearning MaterialsTextures GDC17 FINAL

The document discusses recent advances in using deep learning for content creation. It describes techniques like style transfer for image enhancement and NVIDIA's new GameWorks tools that use deep learning to generate materials from photographs. The tools aim to make content creation more creative and reduce repetitive tasks for artists.

Uploaded by

Sighman Says

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views76 pages

DeepLearning MaterialsTextures GDC17 FINAL

Uploaded by

Sighman Says

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Zoom, Enhance, Synthesize!

Magic Upscaling and Material

Synthesis using Deep Learning
1 March 2017

Marco Foco, Developer Technology Engineer

Dmitry Korobchenko, Deep Learning R&D Engineer
Andrew Edelsten, Senior Developer Technology Manager

Session Description: Recently deep learning has revolutionized computer vision and other
recognition problems. Everyday applications using such techniques are now commonplace
with more advanced tasks being automated at a growing rate. During 2016, “image synthesis”
techniques started to appear that used deep neural networks to apply style transfer
algorithms for image restoration. The speakers review some of these techniques and
demonstrate their application in image magnification to enable “super resolution” tools.
The speakers also discuss recent discoveries by NVIDIA Research that uses AI, machine
learning and deep learning based approaches to greatly improve the process of creating
game-ready materials. Using these novel techniques, artists can use standard DSLR, or even
cell phone cameras, to create full renderable materials in minutes. The session concludes by
showing how developers can integrate these methods into their existing art pipelines.
Takeaway: Attendees will gain information about the latest application of machine and deep
learning for content creation and get access to new resources to improve their work.
Intended Audience: Texture artists, art directors, tool programmers, anyone interested in
latest evolution of deep learning in game development.

1
Overview
 Welcome
 What is Deep Learning?
 “GameWorks: Materials & Textures” [producers and artists rejoice]
 Examine in detail the design of one tool [coders bathe in technical details]
 Wrap up

gameworks.nvidia.com 2

2
Deep Learning – What is it?
 AI vs ML vs DL - great explanation https://goo.gl/hkayWG
 Why now?
 Better algorithms

 Large GPU compute

 Large datasets

 Now, huge progress in many fields:

 Speech (recognition, synthesis)

 Vision (classification, location)

 Language (Search, translation)

 Game AI (Go, Doom, Poker) gameworks.nvidia.com 3

Machine Learning at its most basic is the practice of using algorithms to parse data, learn
from it, and then make a determination or prediction about something in the world. So rather
than hand-coding software routines with a specific set of instructions to accomplish a
particular task, the machine is “trained” using large amounts of data and algorithms that give
it the ability to learn how to perform the task.

One approach to ML was “artificial neural networks” – basically use “simple” math in a
distributed way to try and mimic the way we think neurons in the brain work. Anyway, for
years ANN resulted in nothing until:
Prof Hinton @ Uni of Toronto made the algorithms parallel, and then the algorithms were put
on GPU. Then training sets exploded.

Using DL everyday.. A lot!

Web search
Siri/Google Now
Facebook image/face tagging
Language translation
Style transfer

Neural networks are so useful why now?

 Better algorithms – academics never stopped researching.. They just couldn’t try out til
recently (eg RNN LSTM invented in 1997 -- Hochreiter, Sepp; and Schmidhuber, Jürgen; Long
Short-Term Memory, Neural Computation, 9(8):1735–1780, 1997)
 Large datasets – the digital lifestyles we live, leads to huge data collection
 Large compute – turns out, the math for NN is HIGHLY parallel.. just like graphics! Yay
GPU!

3
Deep Learning is Ready For Use
 Already many ways to use deep learning today Just In!
Baidu DeepVoice
 Chat bots

 Data science and Market analysis (e.g. brand sentiment analysis)

 Text2Speech & Voice Recognition

 Nival’s new “Boris” AI for Blitzkreig 3 - see https://goo.gl/Ah4Mov

 Think how to use it in your game

 Can image classifiers ID NPC’s in bug screenshots?

 Google’s new Perspective API - http://perspectiveapi.com - for “toxic” forums/comments

 Check services from Google, AWS, Azure if you don’t “roll your own”

gameworks.nvidia.com 4

4
Deep Learning for Art Right Now
 Style transfer
 Generative networks creating images and voxels
 Adversarial networks (DCGAN) – still early but promising

 DL & ML based tools from NVIDIA and partners

 NVIDIA

 Artomatix

 Allegorithmic

 Autodesk

gameworks.nvidia.com 5

5
Style Transfer: Something Fun!
 Doodle a masterpiece!
 Sept 2015: A Neural Algorithm of Artistic Style
by Gatys et al
Content Style
 Uses CNN to take the “style” from one image and
apply it to another

 Dec 2015: neural-style (github)

 Mar 2016: neural-doodle (github)
 Mar 2016: texture-nets (github)
 Oct 2016: fast-neural-style (github)
 Also numerous services: Vinci, Prisma, Artisto
gameworks.nvidia.com 6

References:
A Neural Algorithm of Artistic Style paper by Leon A. Gatys, Alexander S. Ecker, and
Matthias Bethge

https://github.com/jcjohnson/neural-style - github repo by Justin Johnson

https://github.com/jcjohnson/fast-neural-style – github repo by Justin Johnson
https://github.com/alexjc/neural-doodle - github repo by @alexjc

Services:
http://ostagram.ru/static_pages/lenta
https://www.instapainting.com/ai-painter
iOS app (calls out to server) http://prisma-ai.com/

Run your own web service: https://github.com/hletrd/neural_style

Decent tutorial: http://www.makeuseof.com/tag/create-neural-paintings-deepstyle-

ubuntu/

6
HTTP://OSTAGRAM.RU/STATIC_PAGES/LENTA gameworks.nvidia.com 7

Can generate some pretty amazing artwork very easily.

But in addition to being a great toy, there is great potential – I mean, the AI is
actually drawing pixels in a meaningful way.
Style Transfer: Something Useful
 Game remaster & texture enhancement
 Try Neural Style and use a real-world photo for the “style”

 For stylized or anime up-rez try https://github.com/nagadomi/waifu2x

 NVIDIA’s new tool

 Experimenting with art styles

 Dream or power-up sequences
 “Come Swim” by Kirsten Stewart - https://arxiv.org/pdf/1701.04928v1.pdf

gameworks.nvidia.com 8

Come Swim paper - https://arxiv.org/pdf/1701.04928v1.pdf

Bhautik J Joshi - Research Engineer, Adobe
Kristen Stewart - Director, Come Swim
David Shapiro - Producer, Starlight Studios
https://www.theguardian.com/film/2017/jan/20/kristen-stewart-research-paper-
neural-style-transfer

8
NVIDIA’s Goals for DL in Game Development
 Looking at all the research, clearly there’s scope for tools based on DL
 Goals:
 Expand the use of deep learning into content creation

 Remove the mundane and repetitive

 Promote increased creativity, realism and experimentation

gameworks.nvidia.com 9

9
“GameWorks: Materials & Textures”
 Set of tools targeting the game industry using machine learning and deep learning
 https://gwmt.nvidia.com

 First release targets textures and materials

 Tools in this initial release:
 Photo To Material: 2shot

 Super-resolution

 Texture Multiplier

gameworks.nvidia.com 10

10
GameWorks: Materials & Textures beta
 Tools run as a web service
 Sign up for the Beta at: https://gwmt.nvidia.com
 Seeking feedback from artists on usage of tools and quality
 Also interested in feedback from programmers on automation, pipeline and
engine integration

gameworks.nvidia.com 11

11
Photo To Material: 2Shot
 From two photos of a surface, generate a “material”
 Based on a SIGGRAPH 2015 paper by NVResearch and Aalto University (Finland)
 “Two-Shot SVBRDF Capture for Stationary Materials”

 https://mediatech.aalto.fi/publications/graphics/TwoShotSVBRDF/

 Input is pixel aligned “flash” and “guide” photographs

 Use tripod and remote shutter or bracket

 Or align later

 Use for flat surfaces with repeating patterns

gameworks.nvidia.com 12

12
Material Synthesis from Two Photos

Flash image Guide image

Diffuse
Specular Normals Glossiness Anisotropy
albedo
gameworks.nvidia.com 13

13
Material Synthesis Process

gameworks.nvidia.com 14

SVBRDF – spatially varying bidirectional reflectance distribution function

14
Demo
Photo To Material: 2Shot

15
Photo To Material: 1Shot
 What’s better than two photos? One!
 SIGGRAPH 2016 paper by NVResearch and Aalto University (Finland)
 “Reflectance modeling by neural texture synthesis”

 http://dl.acm.org/citation.cfm?id=2925917&preflayout=flat

 Includes slides and video presentation

 Uses advanced deep learning research

 Combines feature detection and style transfer to create materials
 Quality does not (yet) match 2shot

gameworks.nvidia.com 16

16
1shot – EARLY Previews

gameworks.nvidia.com 17

17
Texture Multiplier
 Put simply: texture in, new texture out
 Inspired by Gatys et al
 Texture Synthesis Using Convolutional Neural Networks

 https://arxiv.org/pdf/1505.07376.pdf

 Artomatix
 Similar product “Texture Mutation”

 Very cool “Infinity Tile”

 https://artomatix.com/

gameworks.nvidia.com 18

Currently “Beta”
Some artifacts – 256x256 now, with 512 and 1024 coming

18
Super Resolution
 Final tool in the first roll-out of GameWorks: Materials & Textures
 Introduce Dmitry and Marco
 Deep dive on the tool and to explain some recent DL based research and techniques

gameworks.nvidia.com 19

19
Zoom! Enhance!

Yes
Sure!

Can you Zoom on the

enhance that? license plate

gameworks.nvidia.com 20

20
Super-resolution: the task
Constructed
high-resolution image

Given
low-resolution image

H Upscaling n*H

n*W
gameworks.nvidia.com 21

The task is to “generate” a bigger image from a smaller one. If we want to use
machine learning to do this, we can create two set, one of big images and one of
their downscaled version, and train our system with these two sets

21
Super-resolution as reconstruction task
Unknown original
high-resolution image Reconstructed image

Given image

Downscaling Reconstruction

Can we reconstruct the original image?

gameworks.nvidia.com 22

Another option is to see our task as a reconstruction. If we make the downscaling

part of the process, we can use just one set, and the expected value for our system
will be the input itself

22
Super-resolution: ill-posed task
Pixels of the Pixels of the
original image reconstructed image
Pixels of the
given image
? ? ?
? ? ? ? ? ?
Downscaling Reconstruction ? ? ?
? ? ? ? ? ?
Information ? ? ?
is lost here ? ? ? ? ? ?

gameworks.nvidia.com 23

But the problem is ill-posed. We first remove some information, and then try to
reconstruct the image using less data (1/4, in this case, 1/n^2 for a downscale factor
n)

23
Super-resolution: ill-posed task
Pixels of the Pixels of the
original image reconstructed image
Pixels of the
given image
? ? ?
? ? ? ? ? ?
Downscaling Reconstruction ? ? ?
? ? ? ? ? ?
Information ? ? ?
is lost here ? ? ? ? ? ?

Reconstruction of the original image is impossible

gameworks.nvidia.com 24

So we can’t reconstruct the original image, the information is missing!

24
Super-resolution: ill-posed task

OR DO YOU?
gameworks.nvidia.com 25

25
Where does the magic come from?
•Let’s consider 8x8 patch of some 8-bit grayscale image
•How many of such patches are there?

gameworks.nvidia.com 26

Let’s consider a small portion of the original image, say 8x8 patch, and let’s consider
a single channel of 8 bit.

26
Where does the magic come from?
•Let’s consider 8x8 patch of some 8-bit grayscale image
•How many of such patches are there?

N = 256(8∗8) ≈ 10153

gameworks.nvidia.com 27

The number of possible values for the pixel is 256, and the number of pixels is
8x8=64, so the total number of possible images is quite big

27
Where does the magic come from?
•Let’s consider 8x8 patch of some 8-bit grayscale image
•How many of such patches are there?

N = 256(8∗8) ≈ 10153
•More than the number of atoms in observable universe

We don’t need that much

gameworks.nvidia.com 28

That’s actually more atoms than the observable universe, maybe an image contains
less information than this.

28
Where does the magic come from?

Photos
Natural images
Textures

All possible images

We want to work in the domain of “natural images”
gameworks.nvidia.com 29

Indeed, among the possible images, photos and textures are a very small subset

29
Super-resolution under constraints
•Data from the natural images is sparse or compressible in some domain
•To reconstruct such images some prior information or constraints are required

Downscaling Reconstruction

+
prior information
+
constraints

gameworks.nvidia.com 30

If we constrain our problem to deal with natural images and textures, we can be
enhance the content without much loss

30
Hand-crafted constraints and priors
•Interpolation (bicubic, lanczos, etc.)
•Interpolation + Sharpening (and other filtration)

interpolation filter-based sharpening

•Such methods are data-independent
•Very rough estimation of the data behavior  too general

gameworks.nvidia.com 31

One possible option is to construct an upscaling method taking some a priori decisions
on the resulting image (e.g. sharpness)
This will work in some cases, but in general will require a lot manual of work to
handmake the upscaling logics into our algorithm
We need a better method, something that looks into images from our specific domain
and finds which are the interesting features.
These methods are usually machine learning methods

31
Super-resolution: machine learning
Idea: use machine learning to capture prior knowledge and statistics from the data

Mathematical
optimization

Machine
Computer
science learning

Statistics

gameworks.nvidia.com 32

The idea is to exploit prior knowledge about our image domain. We can gather such
information using machine learning. Since the machine learning is a technique of
building intelligence systems, which are not explicitly programmed, but trained using
an error minimization to capture and exploit internal structure and features of the
training data automatically.

32
Patch-based mapping
Low-resolution patch Mapping High-resolution patch

Model
params

gameworks.nvidia.com 33

Let's reduce our task to a simpler one: transformation of an image patch. Let's
consider a mapping function, which constructs high-resolution patch by a given low-
resolution patch from the input image. Such mapping function will depend on a set of
parameters, which we want to find using machine learning.

33
Patch-based mapping: training
Low-resolution patch Mapping High-resolution patch

Model
params

LR,HR
Training images pairs of patches

gameworks.nvidia.com 34

We are training our model in a supervised fashion. So we need to collect a set of

pairs of low-resolution and corresponding ground-truth high-resolution patches, what
could be easily done if we have a set of high-resolution images.

34
Patch-based mapping: training
Low-resolution patch Mapping High-resolution patch

Model
params

training

LR,HR
Training images pairs of patches

gameworks.nvidia.com 35

And we pass that set of pairs into the training process,

35
Patch-based mapping: training
Low-resolution patch Mapping High-resolution patch

Model
params

training

LR,HR
Training images pairs of patches

gameworks.nvidia.com 36

after which we expect that our model will be capable to predict high-resolution
patch in the most optimal way.

36
Patch-based mapping
𝒙𝑯
𝒙𝑳

Encode Decode

LR patch

HR patch

High-level information about the patch

“features”

gameworks.nvidia.com 37

A good way to build the mapping function is to use an encoding of an input patch into
some intermediate scale-invariant representation, which will carry some semantic
information about the patch.

37
Patch-based mapping: sparse coding
𝒙𝑯
𝒙𝑳

Encode Decode

LR patch

HR patch

High-level information about the patch

“features”

Sparse
code

gameworks.nvidia.com 38

One way to build such representation is sparse coding. Here we exploit our prior
knowledge, that our signal is sparse in some domain.

38
Sparse coding and dictionary learning
•Image patch could be presented as a sparse linear combination of dictionary elements
•Dictionary is learned from the data (in contrast to hand-crafted dictionary like DCT)

𝑫
𝒙 = 𝑫𝒛 = 𝒅𝟏 𝒛𝟏 + ⋯ + 𝒅𝑲 𝒛𝑲

𝑫 - dictionary
𝒙 - patch
= 0.8 * + 0.3 * + 0.5 *
𝒛 - sparse code
𝒙 𝒅𝟑𝟔 𝒅𝟒𝟐 𝒅𝟔𝟑

gameworks.nvidia.com 39

In particular, we assume that the patch could be represented as a linear combination

of only a small number of elements from some dictionary. Using that dictionary, we
can obtain corresponding coefficients (also known as sparse codes, carrying high-level
representation) and vise-versa.

39
Patch-based mapping via sparse coding
Mapping

𝒙𝑳

LR patch

gameworks.nvidia.com 40

How could it be used for super resolution?

Given a low-resolution patch....

40
Patch-based mapping via sparse coding
Mapping

𝒙𝑳
𝒛 = 𝒂𝒓𝒈𝒎𝒊𝒏 𝑫𝑳 𝒛 − 𝒙𝑳 + 𝜸 𝒛 𝟎

Encode

LR patch

𝑫𝑳

Sparse
code
LR dictionary

gameworks.nvidia.com 41

and low-resolution dictionary, we perform the sparse encoding (using some

optimization procedure).

41
Patch-based mapping via sparse coding
Mapping
𝒙𝑯
𝒙𝑳
𝒛 = 𝒂𝒓𝒈𝒎𝒊𝒏 𝑫𝑳 𝒛 − 𝒙𝑳 + 𝜸 𝒛 𝟎 𝒙𝑯 = 𝑫𝑯 𝒛

Encode Decode

LR patch

𝑫𝑳 𝑫𝑯 HR patch

Sparse
code
LR dictionary

HR dictionary
gameworks.nvidia.com 42

Then, given the sparse codes and high-resolution dictionary, we perform decoding,
simply calculating the linear combination.

42
Patch-based mapping via sparse coding

Learned from training data

𝑫𝑳 𝑫𝑯

LR dictionary

HR dictionary
gameworks.nvidia.com 43

We train the dictionaries to capture internal structure of the signal by maximizing

the sparsity of the encoding.

43
Generalized patch-based mapping

Mapping in the
Mapping feature space Mapping

LR patch
High-level High-level
representation of representation of HR patch
the LR patch the HR patch

“features”
gameworks.nvidia.com 44

We may generalize the idea and build another mapping function with more complex
internal representation. For example, first map input patch into corresponding high-
level representation. Then perform some transformation in that space. And then map
resulting high-level representation back to image space -- to high-resolution patch.

44
Generalized patch-based mapping

Mapping in the
Mapping feature space Mapping

𝑊1 𝑊2 𝑊3

LR patch

HR patch

Trainable parameters

gameworks.nvidia.com 45

All transformations depend on some parameters, which we adjust during the training.
This could be a neural net, for example.

45
Mapping of the whole image: using convolution
Convolutional operators

HR image

LR image

Mapping Mapping in the Mapping

feature space

gameworks.nvidia.com 46

Now let's recall that we actually want to do a super-resolution for the whole image.
In this case, we can apply our patch-based transformation to the set of all
overlapping patches on the input image, and then assemble resulting high-resolution
patches into high-resolution output. These operations could be implemented via a
convolutional operator. And presented structure is very similar to one well-known
type of neural networks -- auto-encoders.

46
Auto-encoder

input output ≈ input

gameworks.nvidia.com 47

What’s an Auto-Encoder?
It’s a neural network trained to reconstruct its input.
What’s difficult is doing it by passing to an internal representation, with less
information (hourglass structure)

47
Auto-encoder
Encode Decode

input output ≈ input

features

gameworks.nvidia.com 48

An autoencoder network is composed by two parts, an ENCODER which take the input
and converts it to the internal representation (feature space) and a DECODER which
tries to regenerate the input

48
Auto-encoder
parameters
𝑊
Inference
𝑦 = 𝐹𝑊 (𝑥)

𝑥 𝑦 Training

𝑊 = 𝑎𝑟𝑔𝑚𝑖𝑛 ෍ 𝐷𝑖𝑠𝑡(𝑥𝑖 , 𝐹𝑊 𝑥𝑖 )
𝑖

𝑥𝑖 - training set

gameworks.nvidia.com 49

When encoder and decoder are modeled by a DNN, the parameter space is defined by
a set of Weights (W).
In the training we try to minimize a specific loss function (or “distance” between the
input and the output). If there’s enough information in the middle layer + in the prior
knowledge, the reconstruction will be perfect (distance will be 0), if there isn’t
enough information, the network will try to minimize the distance measured on the
training set.

49
Auto-encoder
Encode
 Our encoder is LOSSY by definition

input

features

information loss
gameworks.nvidia.com 50

In our case, we KNOW the internal representation is LOSSY because we explicitly

introduced a downscale layer (which removes information) when creating our
encoder.

50
Super-resolution auto-encoder
parameters
𝑊
Inference
𝑦 = 𝐹𝑊 (𝑥)

𝑥 𝑦 Training

𝑊 = 𝑎𝑟𝑔𝑚𝑖𝑛 ෍ 𝐷𝑖𝑠𝑡(𝑥𝑖 , 𝐹𝑊 𝑥𝑖 )
𝑖

𝑥𝑖 - training set

gameworks.nvidia.com 51

51
Network topology
Using global information
Fixed-resolution
Better result (?)

Use only local information

Less parameters
Scalable network
Less quality (?)

gameworks.nvidia.com 52

Using all pixels in the image. Does this means having better results? Maybe.
Using only local information we have less parameters, a scalable network. Does this
mean less quality? Not necessarily, we are using LOCAL information.

52
Super-resolution convolutional auto-encoder
parameters
𝑊  Only use size-independent layers
 Convolution

 Downscaling
 Pooling

 Strided convolution
𝑥 𝑦  Upscaling
 Data replication

 Interpolation

 Deconvolution

Local connections only

gameworks.nvidia.com 53

53
Super-resolution convolutional auto-encoder
 Why Downscaling?
 Collect multi-scale information

 Augmenting the receptive field

 What does different scale features mean?

 Full-resolution features contains an approximation of the details

 Deeper features

 Contain higher semantic information

 Allow to provide context for the detail generation

 Downscale  Information loss?

 Information from all scales will be collected into the encoded representation

gameworks.nvidia.com 54

Definition of receptive field.

Deeper downscaled layers contains more feature (channels)

54
SRCAE: Overview
In Down … Down Up … Up Out

In Input translation

Down N blocks (2x downscaling)
Up N+S blocks (2x upscaling)
Out Output translation

Total upscaling for the network: 2Sx

gameworks.nvidia.com 55

55
SRCAE: Input translation
In Down … Down Up … Up Out

“In” block
Convolution (5x5)
Feature expansion (3->32)
ReLU

gameworks.nvidia.com 56

56
SRCAE: Encoder
In Down … Down Up … Up Out

“Down” block
3x3 Convolution
ReLU
3x3 Convolution
ReLU
3x3 Strided (2x) convolution with feature expansion
ReLU
gameworks.nvidia.com 57

57
SRCAE: Decoder
In Down … Down Up … Up Out

“Up” block
3x3 Convolution
ReLU
3x3 Convolution
ReLU
3x3 Strided (2x) deconvolution with feature reduction
ReLU
gameworks.nvidia.com 58

58
SRCAE: Output
In Down … Down Up … Up Out

Feature reduction (32->3)

(optional) Clipping to range (0-1 or 0-255)

gameworks.nvidia.com 59

59
SRCAE: Training
y
𝑥
𝑥ො 𝐹W
𝐷
Downscaling SRCAE

𝑊
LR image

Ground-truth HR image Reconstructed HR image

𝑊 = 𝑎𝑟𝑔𝑚𝑖𝑛 ෍ 𝐷𝑖𝑠𝑡(𝑥𝑖 , 𝐹𝑊 𝐷(𝑥𝑖 ) ) 𝑥𝑖 - training set

𝑖

gameworks.nvidia.com 60

60
SRCAE: Inference
y

𝑥ො 𝐹W

SRCAE

𝑊
Given LR image

Constructed HR image

𝑦 = 𝐹𝑊 (𝑥)
ො

gameworks.nvidia.com 61

61
Super-resolution: ill-posed task?

gameworks.nvidia.com 62

62
Distance/Loss function
Distance function is a key element to obtain good results.

𝑊 = 𝑎𝑟𝑔𝑚𝑖𝑛 ෍ 𝐷 𝑥𝑖 , 𝐹𝑊 (𝑥𝑖 )
𝑖

Choice of the loss function is an important decision

gameworks.nvidia.com 63

MSE, L2 and L1 metrics will eventually converge to the results shown before, and
indeed when we started we was using MSE.
We started with MSE, but we obtained better results with another metric.

63
Loss function
MSE
Mean Squared Error
1 2
𝑥 −𝐹 𝑥
𝑁

gameworks.nvidia.com 64

Loss function is important. Generally, people use the MSE loss function, which stands
for mean squared error.

64
Loss function
MSE PSNR
Mean Squared Error Peak Signal-to-Noise Ratio
1 2
2 𝑀𝐴𝑋
𝑥 −𝐹 𝑥 10 ∗ 𝑙𝑜𝑔10
𝑁 𝑀𝑆𝐸

gameworks.nvidia.com 65

Since we are solving an image reconstruction task, it is good to consider a

correspondence between loss which we minimize and image quality metrics which we
use to evaluate our reconstruction quality. It is easy to notice that MSE closely
relates to well-known PSNR metric, which stands for peak signal to noise ratio. But
MSE is too general, and PSNR poorly represents perceptual image quality. The
solution is to find some other metric, which is closer to human perception.

65
Loss function: HFEN
MSE PSNR
Mean Squared Error Peak Signal-to-Noise Ratio
1 2
2 𝑀𝐴𝑋
𝑥 −𝐹 𝑥 10 ∗ 𝑙𝑜𝑔10
𝑁 𝑀𝑆𝐸

HFEN*
High Frequency Error Norm High-Pass filter

𝐻𝑃(𝑥 − 𝐹 𝑥 ) 2

Perceptual loss

* http://ieeexplore.ieee.org/document/5617283/ gameworks.nvidia.com 66

Since we want to reconstruct fine details, a metric, which considers high-

frequencies, could be useful. One of these is High Frequency Error Norm, broadly
used in medical imaging. It uses High-pass operator to concentrate only on high-
frequency details. Here is an example of how the operator works -- it highlights the
edges. Another advantage of this operator -- it is linear, thus differentiable and easily
implementable within an autoencoder loss function, which now could be considered
as perceptual loss.

66
Perceptual loss
𝑥
𝐺 𝑥

Perceptual
Image features

gameworks.nvidia.com 67

We can generalize this idea. Suppose we have some transformation, that extracts
perceptual features.

67
Perceptual loss
𝑥
𝐺 𝑥 Perceptual features
• High-frequency information
1
• 𝐺 𝑥 = 𝐻𝑃 (𝑥)
𝑁
• CNN features*
• 𝐺 𝑥 = 𝑉𝐺𝐺(𝑥)
Perceptual
• Other
Image features

* https://arxiv.org/abs/1603.08155 gameworks.nvidia.com 68

It could be the mentioned HFEN, or some other operator, extracting important

details. Or it could be semantic features, extracted by means of some Convolutional
Neural Network. Example -- VGG features.

68
Perceptual loss
𝑥
𝐺 𝑥 Perceptual features
• High-frequency information
1
• 𝐺 𝑥 = 𝐿𝑜𝐺(𝑥)
𝑁
• CNN features*
• 𝐺 𝑥 = 𝑉𝐺𝐺(𝑥)
Perceptual
• Other
Image features

Total loss = Regular loss + Perceptual loss

1 2 2
𝐿= 𝑥 − 𝐹(𝑥) + 𝛼 𝐺 𝑥 − 𝐺(𝐹 𝑥 )
𝑁

* https://arxiv.org/abs/1603.08155 gameworks.nvidia.com 69

Then, having a perceptual loss, which is focused on some specific component we can
construct the total loss as a weighted sum of regular content loss and the perceptual
loss.

69
Perceptual loss
𝑥
𝐺 𝑥 Perceptual features
• High-frequency information
1
• 𝐺 𝑥 = 𝐿𝑜𝐺(𝑥)
𝑁
• CNN features*
• 𝐺 𝑥 = 𝑉𝐺𝐺(𝑥)
Perceptual
• Other
Image features

Total loss = Regular loss + Perceptual loss

1 2 2 2
𝐿= 𝑥 − 𝐹(𝑥) + 𝛼 𝐺1 𝑥 − 𝐺1 (𝐹 𝑥 ) + 𝛽 𝐺2 𝑥 − 𝐺2 (𝐹 𝑥 ) +…
𝑁

* https://arxiv.org/abs/1603.08155 gameworks.nvidia.com 70

Or in more general case even a combination of different perceptual losses.

70
Regular loss
Result 4x Result 4x

gameworks.nvidia.com 71

Here is the 4x upscaling result using regular MSE loss.

71
Regular loss + Perceptual loss
Result 4x Result 4x

gameworks.nvidia.com 72

And here is the upscaling with the perceptual loss. Edges have become sharper,
aliasing effect is reduced.

72
Demo
Super-Resolution

73
Generative Adversarial Networks

?
Generator Discriminator

Goal Goal
Maximize the error of the Distinguish generated
Discriminator images from real images

gameworks.nvidia.com 74

One of the breakthrough technology in modern machine learning is Generative

Adversarial Networks, or GANs. They are used to improve quality of a generative
model. For example, let's say we have a generator which we want to train to
generate human faces. And we want it to be good in it. For this reason, we construct
a second network, called discriminator, whose goal is to distinguish generated images
from the real images. Now, the goal of generator is to generate images,
indistinguishable from the real ones, or similarly, maximize the error of the
discriminator. They both are trained in parallel to boost their skills, and ideally we
obtain a perfect generator in the end.

74
Super-resolution: GAN-based loss
𝐹(𝑥)
real
𝑥
𝑦 𝐷(𝑦)

Generator Discriminator fake

GAN loss = −𝑙𝑛𝐷(𝐹 𝑥 )

Total loss = Regular loss + GAN loss

gameworks.nvidia.com 75

Super-resolution is also a generative task. So, let's try to apply GANs to it. As a
generator let's take our super-resolution auto-encoder, and as a discriminator, let's
train a binary classifier, which will distinguish upscaled and real high-resolution
images.
This will alter the loss function of our autoencoder, and such additional term could
be considered as a special type of perceptual loss.

75
Questions?
Marco Foco, Developer Technology Engineer
Dmitry Korobchenko, Deep Learning R&D Engineer
Andrew Edelsten, Senior Developer Technology Manager