Image Colorization With Deep Convolutional Neural Networks
Image Colorization With Deep Convolutional Neural Networks
1 Introduction
Colorization is the process of introducing hues black and white images or videos.
There exist a large number of historic photographs and videos which contain insuf-
ficient amount of colors and luminance information. Colorizing those images will
help us in recreating those moments and a better perception of the old times. The
rapid progress in multimedia application computer technology has resulted in a rapid
increase in digital image use. The rich knowledge contained a wide range of uses
including crime prevention, military, home entertainment, education, and cultural
heritage are used in this data collection and medical diagnosis. It is a very chal-
lenging task to make efficient use of this information, to explore and analyze the vast
volume of image data. Although the human visual system can interpret hue informa-
tion more accurately than the monochrome information, the colorization process can
add value to the monochrome images and TV programs. Up until now, colorizing
has been done manually by Photoshop which is not so effective and time consuming
as well.
With the help of deep learning and convolutional neural networks (CNN), this
process is automated by providing more information and context about the image.
CNNs are deep artificial networks that are primarily used for object classification and
recognition. The layered architecture of CNN helps us in recognizing object patterns.
The deeper we move into the network more the patterns are detected and objects
get recognized [1]. There are many methods which have been successfully able to
colorize grayscale images. Such approaches can be loosely categorized into two
groups: one in which an individual pixel is assigned a color based on its brightness,
as derived from a color image of relevant content, and the other in which the image
is fragmented into regions, that are each then allocated a single color. This paper
is based on a previous approach: We evaluate the color material of the training
image and then try to anticipate colorized version on a per-pixel basis for a single
target grayscale image. Whereas segmentation methods have an inherent attraction
and could consequence in more coherent colorization, and these methods depend on
accurate image segmentation, which can be discarded by shadows, lighting effects, or
color gradients. Furthermore, training color sets are selected for manual identification
of the objects in a scene.
One of the major problems in the colorization is that two objects with different
colors may appear to have the same color in grayscale mode. One simple solution
for this problem is to seek user inputs for colors. However, doing so will make this
solution very tedious as it requires the user to repaint all the segments of the image.
Most of the automatic colorization techniques are computationally expensive. The
objective of our approach is to efficiently automate the process by training a machine
learning model on a relatively small corpus of training images. For this research work,
we work in the LAB color space instead of RGB, where L stands for luminance, A
and B for chrominance. Hence, the input to our algorithm is the grayscale image
which is L and the output would be A and B.
This paper is structured as follows: Sect. 1 discusses the history of CNN and
its importance to colorization. Section 2 provides a comprehensive of the work
concerned in this field. The planned work is explained in detail in Sect. 3. First,
we have introduced the Alpha version to colorize grayscale images, then we have
enhanced the Beta version by adding a function extractor to it, and in our final version,
we have loaded the weights from the inception resnet template to boost our image
classifier. Section 4 describes the findings presented followed by the interpretation
and possible context in Sect. 5.
Image Colorization with Deep Convolutional Neural Networks 47
2 Literature Survey
A number of researchers are doing research in this field to provide effective solu-
tions for colorizing grayscale images. Welsh et al. [2] suggested a quasi-automatic
shading process images using reference images and providing variations to the lumi-
nance values of images. This method uses the luminance values of the adjacent pixels
into the focus picture and fills colors from the corresponding location in the refer-
ence image, but the user must find a reference image containing colors in the desired
regions. Matching quality is enhanced by taking advantage of the qualities and struc-
tures of the luminance of the surrounding pixels. These color exchange strategies
provide adequate coloring productivity, stipulated that the input image has unique
luminance standards or graphics around the edge of the element. An alternative solu-
tion is to enable the client to allocate colors to certain pixels and to perpetuate those
colors to the surviving pixels. The propagation problem is formulated by reducing
the polynomial value function, which means that neighboring pixels with similar
intensity should have similar shades [3]. Here, the user chooses colors directly and is
able to refine the results by scribbling more colors in the image. This methodology
was particularly helpful in colorizing animation and cartoon movies.
In [4], the colors of the reference pixels are mixed to decorate the destination
pixel centered on the geodesic range from the outlet pixels to the destination pixel.
Geode range tests the variance of the luminance from either the reference pixel to
the destination pixel across the route. However, spread basic schemes could generate
obscuring mistakes in color and their performance is significantly affected by the
location of color sources. Deshpande et al. [5] presented the colorizing as linear
problem.
Another significant approach is used in [6] where a user is asked for color values
onto specific pixels and these are then used to colorize the image. Although effective,
this is very tedious task as it asks for user to help again and again. While image
segmentation is an intriguing approach which can lead to erroneous results when
dealing with background clutter, dissimilar patterns, and non-identical hues. The
task of colorizing the images can be automated using machine learning [7]. By
automating this process not only computational cost and time is reduced but it also
is known to decrease the error rate by half. Inspired by the above-mentioned issues
related to colorization and benefits of machine learning,
We design a system that incorporates a deep convolutionary neural network
recruited from ground up with strong-level characteristics extracted from a pre-
trained Inception ResNet-v2 model. Our approach aims to efficiently transfer the
machine learning output to colorize the image. It is done by training our machine on
small training data and thereby reducing computational cost of the system.
48 S. Pahal and P. Sehrawat
3 Proposed Models
This is the most basic version of our neural network that effectively colorizes trained
images (Fig. 1). First, we changed the color channels from RGB to LAB using a pre-
defined algorithm. L holds for brightness, as well as a and b for hues of red and blue,
yellow. Its laboratory encrypted image has one layer of grayscale and three layers
of shadow have been packed into two layers. It implies that we could use the first
grayscale picture in our final expectations. Convolutional filters are used between
input and output values to tie them together, a convolutionary neural network [8, 9].
Each filter extracts some of the information from the picture.
The system could either make a new picture from either a filter or incorporate a
range of filters into a single picture. Throughout the case of a convolutionary neural
network, each filter is adjusted instantly to assist with just the expected result.
Hundreds of filters are then stacked and narrowed down to two layers, namely a
and b. The neural network then predicts values based on the grayscale input. The
predicted values are then mapped with the real values. The real color parameters are
from −128 to 128, that is the preferred period for just the LAB color space. We use
the Tanh activation function to map the predicted values. You offer the Tanh function
for a certain value, it returns −1 to 1. This allows us to compare our prediction error.
The network keeps revising the filters to reduce the final error and keeps iterating until
the error is minimum. The convolutional network then produces colorized version
based on images learned from past.
32*32*3 image
5*5*3 filter
2
In Alpha colorization, we have been successful to colorize an image that network has
been trained on. However, for the images that it has not been trained on it shows a
poor result. In the development of the beta version, the objective is to generalize our
network and teach the network to color the image that you have not seen before. For
this process, feature extractor [10] is used which find the link between the grayscale
images and their colored images.
Firstly, simple patterns including a diagonal line, all black pixels and other similar
patterns are considered. In each square (filter), we look for similar patterns and
will remove the pixels that do not match. By doing this, we will generate as many
images as number of filters. Scanning those images will yield similar patterns that are
already detected by the network. Reducing image size will also be helpful in detecting
patterns. With these filters, a nine pixel (3 * 3 filters) layer is obtained. Combining
these with low-level filters, we will be able to detect more complex patterns. Many
shapes like half circle, a small dot, or a line will be formed and it will generate
128 new filtered images. First, low-level features like edges, curved are formed, and
these are then combined into patterns, details, and will eventually result into output.
Our network is operating in a trial and error manner. Initially, random estimates are
made for each pixel. On the assumption of an error for per each pixel, it operates
backwards through the network to improve the extraction of the functionality. In this
case, the source of largest errors is in locating the pixels and deciding whether to
color it or not. Then all items are colored brown, because this color is most identical
to any other color and produces the smallest error. Since most of the training dataset
is identical, the system is trying to deal with different objects. More tones of brown
would be there in the colorized version and it will fail to generate more nuanced
colors as shown in Fig. 2. The major difference from certain visual networks is the
significance of pixel positioning. For coloring networks, the width or proportion
of the picture remains the same across the network. Coloring does not cause any
distortion in the final image created while the image becomes blurred the closer it
moves to the final layer in traditional networks.
In classification networks, the max-pooling layers increase the density of infor-
mation but also distort the image. It values are only the information, but not an
image’s layout. Instead, we use a step of 2 in these networks to reduce the width
and height by half. This increases the density of information, but does not distort the
image. Only the final classification is concerned with classification networks. They
therefore continue to reduce the size and quality of the image as it traverses through
the network. The ratio of the image is maintained by appending the white padding
as shown above; otherwise, each convolutionary layer cuts the images.
ORIGINAL GREY
SCALE IMAGE
Scaling Scaling
(224*224) (299*299)
Encoder Inception-
ResNet-v2
Fusion Layer
Decoder
Up sampler
(2*2)
Colorized Image
The final colorization neural network is depicted in Fig. 3 which is built of four main
components—encoder, decoder, fusion layer in between and a classifier running in
parallel. The classification layer is extracted from the Inception ResNet V2 [11, 12]
and fused with the output of encoder.
Our deep CNN is built from scratch whereas the classifier is a pre-trained model.
Thus, the classifier helps in identifying objects in the picture well and thus matching
the object representation to the corresponding color scheme. We use E, the one with
the fusion layer which is easy to understand and reproduce in Keras. While the deep
CNN is trained from scratch, Inception ResNet-v2 is used as a high-level extractor of
features that provides information about the content of images that can help colorize
them. The pictures considered in the CIE L * a * b * color space are of size H/W.
Starting with the luminance component XL/R H × W × 1, our model aims to get
estimate of the remaining components in order to result into a fully colored version
X˜ ∈ R H × W × 3. In short, it is assumed that there is a mapping of F such
that F: XL → (X˜ a, X˜ b), (1) where X˜ a, X˜ b are the a *, b * components of
the reconstructed image, which combined with the input provide the approximate
colored image X˜ = (XL, X˜ a, X˜ b). The proposed architecture is based entirely
on CNNs, an effective model that has been extensively used by researchers in the
literature, to be independent of the size of the input. In short, a convolutionary layer
is a set of short, learnable filters in the input image that fit specific local patterns.
Near-input layers search for simple patterns such as contours, while those closer to
the output extract more complex characteristics.
We select the CIE L * a * b* color space to represent the input images as it
distinguishes the characteristics of color from the luminance comprising the main
image characteristics. The combination of the luminance with the color components
is guaranteed that the final reconstructed picture has a high level of detail. The model
estimates that it is a * b* component given the luminance component of an image and
combines it with the input to obtain the final approximation of the colored image. We
use an Inception-ResNetv2 (referred to as Inception) network instead of training a
feature extraction branch from scratch in order to get the grayscale image embedded
from its last layer. The network is logically divided into four major components.
The feature’s encoding and extraction components, respectively, obtain mid-level
and high-level features which are then fused into the fusion layer. Lastly, to estimate
output, these features are exploited by decoder. The function of each component is
given below.
Preprocessing The pixel values are centered and scaled depending on their respec-
tive ranges for all three image components. The values are achieved within the interval
of [−1, 1] in order to ensure correct learning.
Encoder The encoder processes grayscale images from H × W and gives output
in form of a representation of features from H/8 × W /8 × 512. It makes use of 8
convolutionary layers with 3 × 3 kernels for this purpose. Padding is used to preserve
52 S. Pahal and P. Sehrawat
the layer’s input width. In addition, the first, third and fifth layers implement a step
of 2, thus reduce their output dimension by half and decrease the total number of
computations needed.
Fusion The fusion layer takes the function vector from Inception, replicates it twice
in HW/8 and adds it to the encoder’s volume of features along the axis of depth.
This approach provides a single volume with the encoded image and the mid-level
characteristics of the shape H/8 × H/8 × 1257. By concatenating and mirroring
the feature vector multiple times, it is guaranteed that the semantic information
transmitted by the feature vector has uniform distribution across the entire image’s
spatial region. In addition, this approach also makes arbitrary input image sizes
robust, increasing the versatility of the model. Ultimately, we apply 256 size 1 × 1,
convolution kernels, effectively producing a H/8 × W /8 × 256 feature dimension.
By minimizing an objective function specified over the estimated output and the
target output, the optimal model parameters are determined. We use the mean square
error metric between the predicted pixel colors in a * b * space and the actual values
to measure the design loss. For a particular picture X, the mean square error is given
by,
H W 2
1
C(X, θ ) = X ki, j − X ki, j
2H W k∈(a,b) i=1 j=1
where θ represents all model parameters, X ki, j and X ki, j denote the i, jth pixel values
of the kth component of the target and the reconstructed image, respectively. This can
easily be extended to a batch B by averaging the cost among all images in the batch,
i.e., 1/|B|P X ∈ B C(X, θ ). This loss is propagated back during learning to update the
parameters of the model using Adam optimizer with an initial learning rate of η =
0.001. We enforce a fixed image size during training to allow batch processing.
Image Colorization with Deep Convolutional Neural Networks 53
4 Simulation Results
The proposed architecture is implemented with Keras and Tensorflow backend (with
Python 3.0 version). The implementation details for the CNN are as follows:
• Model: Convolutional neural network
• Dataset: Public dataset imported from unsplash
• Optimizer: RMSProp(alpha and beta version) and Adam optimizer(full version).
• Activation function: RELU
• Loss: MSE
• Colorspace: Lab
• Layers: Kernels of size 2 × 2 with stride of 2(beta version) and Kernel of size
1 × 1(full version).
The output for trained and untrained images for Alpha model is illustrated in Fig. 4
which implies that in case of colorizing trained images; Alpha version successfully
recreates the exact colors as in the original image.
Trained Image
Grayscale epoch=10 epoch=100 epoch=500 Original
Untrained Image
Trained Image
Grayscale epoch=10 epoch=100 epoch=500 original
Untrained Image
The output for trained and untrained images for Beta model is illustrated in Fig. 5. In
case of trained images, the Beta version correctly colorizes the image even though
the input given to the network was black and white.
5 Conclusion
This paper implements deep learning convolutional neural architecture which is suit-
able for colorizing black and white images. The proposed model has been success-
fully able to colorize back and white images up to a perceptive ability of around
80%. However, there is still room for improvement in coloring the image with great
Image Colorization with Deep Convolutional Neural Networks 55
Trained Image
Grayscale epoch=250 epoch=1000
Untrained Image
Grayscale epoch=250 epoch=1000
details and background clutter. For the future work, we would recommend running
this model on other pre-trained models and different datasets. Also it would be inter-
esting to see its application on video segments. Various changes and hybridization
of technologies in the architecture of neural networks is still a field to be explored.
References
8. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image
recognition. CoRR abs/1409.1556
9. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In:
European conference on computer vision, Springer, pp 818–833
10. Iizuka S, Simo-Serra E, Ishikawa H (2016) Let there be color!: joint end-to-end learning of
global and local image priors for automatic image colorization with simultaneous classification.
ACM Trans Graph (TOG) 35(4):110
11. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the
impact of residual connections on learning. In: AAAI, vol. 4, p 12
12. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778