Semantic guidance incremental network for efficiency video super-resolution

He, Xiaonan; Xia, Yukun; Qiao, Yuansong; Lee, Brian; Ye, Yuhang

doi:10.1007/s00371-024-03488-y

Semantic guidance incremental network for efficiency video super-resolution

Research
Open access
Published: 02 July 2024

Volume 40, pages 4899–4911, (2024)
Cite this article

Download PDF

You have full access to this open access article

The Visual Computer Aims and scope Submit manuscript

Semantic guidance incremental network for efficiency video super-resolution

Download PDF

Xiaonan He¹,
Yukun Xia²,
Yuansong Qiao¹,
Brian Lee¹ &
…
Yuhang Ye¹

1192 Accesses
Explore all metrics

Abstract

In video streaming, bandwidth constraints significantly affect client-side video quality. Addressing this, deep neural networks offer a promising avenue for implementing video super-resolution (VSR) at the user end, leveraging advancements in modern hardware, including mobile devices. The principal challenge in VSR is the computational intensity involved in processing temporal/spatial video data. Conventional methods, uniformly processing entire scenes, often result in inefficient resource allocation. This is evident in the over-processing of simpler regions and insufficient attention to complex regions, leading to edge artifacts in merged regions. Our innovative approach employs semantic segmentation and spatial frequency-based categorization to divide each video frame into regions of varying complexity: simple, medium, and complex. These are then processed through an efficient incremental model, optimizing computational resources. A key innovation is the sparse temporal/spatial feature transformation layer, which mitigates edge artifacts and ensures seamless integration of regional features, enhancing the naturalness of the super-resolution outcome. Experimental results demonstrate that our method significantly boosts VSR efficiency while maintaining effectiveness. This marks a notable advancement in streaming video technology, optimizing video quality with reduced computational demands. This approach, featuring semantic segmentation, spatial frequency analysis, and an incremental network structure, represents a substantial improvement over traditional VSR methodologies, addressing the core challenges of efficiency and quality in high-resolution video streaming.

Attention-guided video super-resolution with recurrent multi-scale spatial–temporal transformer

Article Open access 15 December 2022

A comparative study of super-resolution algorithms for video streaming application

Article 13 October 2023

Deformable Spatial-Temporal Attention for Lightweight Video Super-Resolution

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The widespread adoption of streaming video transmission technology has significantly impacted our daily lives, as its applications continue to expand [10, 13]. Alongside this growth, there has been an escalating demand for high-quality video content [4, 9]. This demand is driven by a variety of factors, including advancements in display technologies and user expectations for more immersive and detailed visual experiences [13]. As a result, the field of video super-resolution (VSR) has gained considerable attention in research circles. The primary aim of VSR is to transform low-resolution (LR) videos into high-resolution (HR) ones, thereby substantially improving the quality of the visual content delivered to end-users.

Particularly in computational efficiency, a critical issue is presented in the conventional approach of VSR research, which primarily concentrates on uniformly processing entire scenes. The majority of VSR approaches involve a holistic processing strategy, where the entire feature map is treated as a unified entity with uniform convolutional kernels and parameters [28]. However, this method fails to consider substantial variations in texture, structure, and semantic information across distinct regions of the feature map. This oversight leads to the inherent differences in texture complexity across various regions of the frames being overlooked. As a result, regions with less complexity receive the same level of processing as more complex regions, leading to over-processing. Such a uniform approach not only increases computational demands but also potentially compromises the quality of VSR. The need for more refined strategies in VSR is evident, where selective processing based on the complexity and characteristics of different frame regions can optimize both the computational efficiency and the quality of the super-resolved output.

A more efficient design should discern and process different regions of a frame selectively, tailoring the application of neural network enhancements to match the complexity of each region. This nuanced processing, as exemplified by methods like those discussed in [7], involves evaluating and treating each frame patch individually based on its specific textural and structural attributes. Such a refined approach not only promises a significant reduction in computational overhead but also augments the overall efficacy of VSR. Moving away from uniform treatment of frame features, ensures optimized visual quality with efficient resource utilization, thereby addressing the core challenges of current VSR methodologies.

Depending on the richness of details, frames can be categorized into smooth regions and texture and edge regions [8]. Patch-based methods are prone to segment similar texture features within different patches. Successive features get inconsistent computational resources due to the difference in patch complexity, resulting in the edges of the patches being prone to different degrees of artifacts and blurring. This approach can lead to discrepancies in how the same objects are processed across different patches, resulting in inconsistent allocation of computational resources. Such inconsistencies can cause visual discomfort due to uneven processing of similar textures and details. Additionally, these methods are prone to creating artifacts at the edges, where distinct regions meet. These challenges not only affect the efficiency of VSR algorithms but also degrade the overall quality of the super-resolution results.

Our approach to solving the problem is straightforward: we classify based on the complexity of different instances and backgrounds within the frame, and then selectively process regions of varying complexities using neural networks of corresponding capabilities. Specifically, we employ LPS-Net [30] for semantic segmentation of each frame, followed by the use of defined spatial frequencies to analyze the complexity of region textures. Semantic segmented regions are categorized into three levels based on their complexity. To reduce the overall parameters of the neural network, we have developed an incremental network structure. This structure ensures that simple regions are processed using Network A, medium–complex regions with Network A+B, and high-complex regions with Network A+B+C. As these regions are non-overlapping, each complexity-level neural network processes distinct sparse data (i.e., sparse feature maps). Consequently, we introduce sparse convolution to avoid processing in sparse regions (where feature values are zero).

Artifacts at the edges are a common challenge when merging regions generated by distinct neural networks [28]. Our initial approach used simple summation with overlapping outputs, configured through coefficient convolution, but this led to blurring at the edges. To resolve this, we introduce a novel adaptive feature fusion structure named sparse-TSFT (sparse temporal/spatial feature transformation layer). Sparse-TSFT differs from traditional feature concatenation by using a convolutional layer to adaptively fuse local and global features. This method effectively integrates sparse and global features, enhancing both the naturalness and integrity of the super-resolution output and accurately emphasizing regions with rich features.

Here, we summarize the main contributions as follows:

We introduce a novel, lightweight incremental VSR network that incorporates spatiotemporal sparse convolution alongside conditional feature fusion, aiming for efficiency and effectiveness.
Our approach applies semantic segmentation to differentiate and classify regions within video frames based on their complexity. These classifications are then inputted into the network as masks, which aid in the gradual integration of both global and regional features.
The network utilizes spatiotemporal sparse convolution to process features from regions with diverse complexities, as defined by the segmentation masks. This method is designed to enhance the model’s efficiency in VSR tasks.
We have devised the sparse temporal/spatial feature transformation layer (sparse-TSFT), a feature fusion layer that integrates regional features, acting as conditions, with global features. This integration is geared toward reducing the edge artifacts, thus striving for improved fusion results in the final output.

2 Related work

2.1 Video super-resolution

A proliferation of research focusing on deep learning-based VSR has emerged due to their strong representation and fitting capabilities [12, 17, 20, 23, 25, 26, 29]. Liu et al. [12] proposed a temporal adaptive neural network and a spatial alignment network, which can adapt to the time dependency and achieve higher robustness by reducing the motion complexity between neighboring frames. Tao et al. [17] designed a unique sub-pixel motion compensation layer, which can fully utilize the multi-frame correlation information and efficiently fuse the image details into high-resolution images by adapting the time dependency. Xue et al. [25] proposed task-oriented optical flow training, a self-supervised training method that jointly trains the whole network to learn the most suitable optical flow to express features for a certain task. Apart from this, Xu et al. [24] presented a novel implicit resampling-based alignment method that leverages coordinate networks and window-based cross-attention, significantly enhancing alignment accuracy and the preservation of high-frequency details. Moreover, some more advanced methods have been proposed. Tian et al. [18] proposed a temporal deformable alignment network (TDAN), which adaptively aligns the reference frame and each support frame at the feature level to alleviate occlusions and artifacts during reconstruction without computing optical flow. EDVR [20] extends TDAN by utilizing a pyramid structure for coarse-to-fine deformable alignment and a novel spatiotemporal attentional fusion module to focus more on large-scale motion variations. BasicVSR++ [2] extends the VSR paradigm by introducing recurrent neural networks alongside novel second-order grid propagation and flow-guided deformable alignment, significantly enhancing the capture and utilization of temporal information for improved resolution and detail fidelity across video frames. Furthermore, incorporating the latest in transformer methodologies, Liang et al. [11] proposed a novel approach that combines the benefits of recurrent processing with transformers, utilizing guided deformable attention for enhanced alignment in video restoration tasks. Our work introduces incremental structure to staged process features for optimizing the SR performance with the smallest possible computation cost.

2.2 Semantic guidance

Semantic guidance is widely applied in various fields such as natural language processing, computer vision, and machine learning. In computer vision, using semantic segmentation as input conditions to generate natural images can help image generation systems better understand the content of the images. Zhu et al. [31] proposed a two-step method to generate new clothes on a wearer. It first generates a blurry semantic segmentation map and then designs a generative model with a composition mapping layer to conditionally generate the final image with precise regions and textures. Ren et al. [16] employed semantic segmentation with distinct motion patterns for different object layers, significantly enhancing the accuracy of optical flow estimation at object boundaries and thus achieving video deblurring. Gatys et al. [5] applied semantic mapping to control perceptual factors in neural style transfer. Wang et al. [21] introduced semantic maps to guide texture restoration in different regions of the image. Additionally, it employs probability maps to capture fine texture differences. Previously, there were few studies applying semantic guidance in the field of video super-resolution. Our work combines the two and selectively utilizes semantic information through the idea of sparse convolution, making video super-resolution more efficient.

3 Motivation

Traditional VSR methods treat a single frame as a whole, overlooking variations in intra-frame information density, leading to significant computational redundancy. Inspired by [14], our approach maximizes SR efficiency by advocating a partition-based processing strategy that acknowledges the diversity of content within individual frames. The core concept of our design is to allocate computational resources based on the complexity of details in different regions within a frame, reducing operational demands while preserving the integrity of the super-resolution output. Our proposed lightweight incremental VSR network, equipped with 3D sparse convolution and sparse-TSFT, opens up a new paradigm for efficient lightweight VSR structure.

We have devised a method for pre-processing video frames that intelligently partitions and labels the data, drawing inspiration from SEEM [14] and ClassSR [7]. SEEM introduces a plug-and-play model that achieves better feature alignment and fusion through the use of semantic information guidance. In contrast, class SR creatively segments video data into uniformly sized rectangular patches and allocates super-resolution model computation power based on the texture complexity within each patch. Therefore, we introduce semantic segmentation as a means of partitioning video frames, classifying regions into simple, medium –complex, and highly complex categories based on texture complexity for facilitating video frame pre-processing.

The outputs from the segmentation of a given frame are inherently sparse. An ordinary convolution process will cause a waste of computing cost in processing the sparse areas. Therefore, we utilize 3D sparse convolution to avoid processing zeros for these sparse areas.

An incremental VSR architecture is used to process and aggregate the segmented outputs of different complexities. The whole feature extraction part consists of a main backbone path and two incremental feature processing branches. The backbone path handles global features, while the branches process features from more complex regions and supplement them into the global feature map.

Due to the need to aggregate the feature map segments of different branch complexity for each frame, inspired by SFT [21], we incorporate a sparse-TSFT module into the network. While SFT originally used segmentation probability maps as prior information to construct a mapping function with global features, in the design of the sparse-TSFT module, we deconstruct the fusion problem into a feature mapping problem. We map regional features as prior information into global features, addressing potential issues of edge blurring or artifacts during feature fusion.

4 Methodology

In Sect. 1, we delineate our approach to data pre-processing, specifically leveraging video semantic segmentation algorithms and video complexity detection methods to generate masks that differentiate between areas of varying complexity.

In a typical natural video, regions such as the sky are often smoother and lack texture and detail, with relatively gradual changes between successive frames. In contrast, areas like shrubbery and branches exhibit richer textures, with more pronounced changes from one frame to the next. To efficiently allocate computational resources to appropriate regions, we employ video semantic segmentation algorithms to partition different categories of areas. We utilize LPS-Net as our primary algorithm for video semantic segmentation, which offers the advantages of high accuracy and low computational load. This process is illustrated as follows:

$$\begin{aligned} \textrm{Seg}(I)| I \in {\mathcal {D}} = \{{\mathcal {R}}_i | i = 1, 2, \ldots , N\} \end{aligned}$$

(1)

Let I be a frame in the video. ${\mathrm {Seg(I)}}$ represents the set of regions obtained after performing semantic segmentation on I, with each region demoted as ${\mathcal {R}}_i$ and N representing the total number of regions.

Following the acquisition of the segmented image, we calculate the texture complexity for different categories of regions. Based on the average complexity, a classification map is generated to determine which regions require focused restoration and which regions do not necessitate substantial computational effort. The classification operation for video segmentation data is represented as follows:

$$\begin{aligned} \Gamma= & {} SF({\mathcal {I}}_i,j) \end{aligned}$$

(2)

$$\begin{aligned} \textrm{RF}({\mathcal {I}}_i,j)= & {} \frac{1}{n_0 - n_1} \sum _{i=1}^{n_0} \sum _{j=2}^{n_1} ({\mathcal {I}}_{i,j} - {\mathcal {I}}_{i,j-1})^2 \end{aligned}$$

(3)

$$\begin{aligned} \textrm{CF}({\mathcal {I}}_i,j)= & {} \frac{1}{n_0 - n_1} \sum _{j=1}^{n_1} \sum _{i=2}^{n_0} ({\mathcal {I}}_{i,j} - {\mathcal {I}}_{i-1,j})^2 \end{aligned}$$

(4)

$$\begin{aligned} \textrm{SF}({\mathcal {I}}_i,j)= & {} \sqrt{RF({\mathcal {I}}_i,j) + \textrm{CF}({\mathcal {I}}_i,j)} \end{aligned}$$

(5)

In the Eq. 2, ${\mathcal {I}}_i,j$ represents the video frame, where i,j denote the row and column indices of ${\mathcal {I}}$, respectively. RF stands for row frequency, which calculates the sum of squared differences of pixel intensities within each row. CF represents column frequency, calculating the sum of squared differences of pixel intensities within each column. SF combines both the row frequency and column frequency components, representing the total spatial frequency. The spatial frequencies of all regions are then subjected to K-means clustering.

$$\begin{aligned} \text {Class} = \text {Kmeans}(\textrm{SF}({\mathcal {R}}_1), \textrm{SF}({\mathcal {R}}_2), \ldots , \textrm{SF}({\mathcal {R}}_i), {\mathcal {K}}) \end{aligned}$$

(6)

Let ${\mathcal {K}}$=3, with the clustering centers, denoted as $C_1$, $C_2$, $C_3$, each representing different levels of texture complexity. Based on the results of the K-means clustering, each region $\mathcal {R_i}$ is assigned a category label $L_i$. The category label for each region is determined by the proximity of its spatial frequency to the clustering centers. A classification map G is then generated, where each region $\mathcal {R_i}$ is colored according to its category label $L_i$. This can be represented as a mapping:

$$\begin{aligned} G({\mathcal {I}}_i,j) = \bigcup _{i=1}^{N} \{{\mathcal {R}}_i \rightarrow {\mathcal {L}}_i\} \end{aligned}$$

(7)

Subsequently, utilizing the generated classification map, we create the medium complexity mask $M_\textrm{med}$ and the complex mask $M_\textrm{cplx}$. Their generation is as follows:

$$\begin{aligned} M_\textrm{med}= & {} \text {Cond}_{x \notin {\mathcal {R}}_{\text {sim}}} (x) \end{aligned}$$

(8)

$$\begin{aligned} M_{\text {cplx}}= & {} \text {Cond}_{x \in {\mathcal {R}}_{\text {cplx}}} (x) \end{aligned}$$

(9)

In Eqs. 8 and 9, ${\mathcal {R}}_{\text {sim}}$ and ${\mathcal {R}}_{\text {cplx}}$ represent the simple and complex areas in the classification map, respectively. $M_\textrm{med}$ sets the medium and complex areas to 1, while other areas are set to 0. Equation 8 represents the computation process that if pixel x is not in the medium or complex areas, $M_\textrm{med}$ is 0; otherwise, it is 1. $M_\textrm{cplx}$ sets the complex areas to 1 and other areas to 0. Equation 9 represents the whole computation process, indicating that if pixel x is not in the complex area, $M_\textrm{med}$ is 0; otherwise, it is 1.

4.1 Backbone

The overall framework of the model is illustrated in the Fig. 1. The video super-resolution network comprises a backbone network and incremental feature fusion branches that handle medium and complex levels of complexity. The backbone network consists of multiple 3D convolutional residual blocks, primarily responsible for extracting global features of the video from spatial-temporal dimensions to ensure the integrity of the overall contour and structure of the super-resolved image. These 3D convolutional residual blocks are composed of two 3D convolutional blocks interspersed with a ReLU activation layer, flanked by skip connections at both ends. The incremental feature fusion layers mainly deal with more complex local features. As shown in Fig. 1, the global features are fed into the incremental feature fusion branches. Here, a mask is inputted to determine areas of feature complexity and to sparsity local spatial-temporal features, the process is shown in Fig. 3. These features are then processed through three layers of 3D sparse convolutional residual blocks to extract deeper local features. The 3D sparse convolutional residual blocks, similar to the 3D convolutional residual blocks, consist of two 3D sparse convolution blocks coupled with a ReLU activation layer and are also flanked by skip connections. The acquired deep sparse features, once restored to regular features, are input into the feature fusion block to adaptively integrate local features into the global features.

4.2 3D sparse convolution

Spatial-temporal features that have undergone mask processing exhibit sparsity. Utilizing conventional 3D convolution to process these sparse features would inadvertently encompass numerous non-sparse pixels within the convolutional kernel’s computational domain, leading to unnecessary computational costs. Therefore, we introduce sparse convolution, a technique capable of exclusively processing sparse pixel regions while adaptively disregarding non-sparse pixel areas. The operational mechanism of sparse convolution is illustrated in the accompanying Fig. 2. The left cube (A) represents the sparse features inputted into the sparse convolutional layer, where the green blocks signify sparse pixels, and the orange transparent blocks indicate non-sparse pixel regions. The right cube (C) depicts the output sparse features, with the non-transparent blocks representing pixels processed by sparse convolution. The central white transparent cube (B) symbolizes the sparse convolutional kernel, with the weights denoted as W$_{(H,W,T)}$ within the blocks, corresponding to the kernel’s weights with the three-dimensional Cartesian coordinates of height (H), width (W), and temporal dimension (T). Taking the output’s yellow block as an example, the yellow dashed-line cube represents the region encompassed by the convolutional kernel, with the computational process involving only the sparse pixels within the kernel, as detailed in the computation equation presented beneath the figure.

Table 1 The Param. represents the number of parameters. FLOPs are computed based on LR frames with a resolution of 320 $\times $ 180

Full size table

4.3 Sparse temporal/spatial feature transformation layer (sparse-TSFT)

The incremental feature fusion layers are utilized to generate local sparse features. These local sparse features, along with the global features, are simultaneously input into the sparse-TSFT layer to enable the adaptive integration of sparse features into the global features.

Each 3D convolutional layer in the sparse-TSFT is paired with a ReLU activation function, forming features of dimensions 32 $\times $ 7 $\times $ 64 $\times $ 64 ($C*T*H*W$). Subsequently, these local features, along with the global features, are fed into the STFT layer to further accentuate local detail information. Inspired by prior research [21], the sparse-TSFT layer aims to learn a mapping function $\Delta $ based on the prior condition, providing a modulation parameter pair $(\alpha , \beta )$. In the edge enhancement network, an affine transformation is applied to each intermediate feature mapping, and the learned parameters are capable of rapidly adapting and establishing correlations with global features.

The parameter pair $(\alpha , \beta )$ is provided by the prior condition according to the mapping function $\Phi : \Delta $ ß $(\alpha , \beta )$. Thus,

$$\begin{aligned} (\alpha , \beta ) = \Phi (\Delta ) \end{aligned}$$

(10)

allows for the retrieval of $(\alpha , \beta )$ from the condition, and the transformation is achieved by shifting the feature mapping of a layer:

$$\begin{aligned} \text {Sparse-TSFT}(\textrm{SF}|\alpha , \beta ) = \alpha \odot \textrm{SF} + \beta \end{aligned}$$

(11)

Here, SF represents the 3D sparse features generated by the incremental fusion branch, which has the same dimensions as $\alpha $ and $\beta $. The symbol $\odot $ denotes element-wise multiplication.

4.4 Feature reconstruction module

This module primarily serves as a key component for integrating the three-dimensional features extracted by the main backbone network and ultimately generating super-resolved video. Initially, the 3D features produced by the main backbone network are transformed into 2D features, a process that can be represented by the following equation:

$$\begin{aligned} F_\textrm{2D}(B,C,H,W) \!=\! \text {Transform}(F_\textrm{3D}(B,C,T,H,W))\quad \end{aligned}$$

(12)

where B, C, T, H, W represent the batch size, number of channels, temporal dimension, height, and width, respectively. $\text {Transform}()$ stands for the operation that merges the channel and temporal dimensions. Subsequently, $F_\textrm{3D}$ is fed into six two-dimensional residual blocks. Each basic two-dimensional residual block is composed of two convolutional layers with a kernel size of 3 $\times $ 3, interspersed with a ReLU activation layer, and flanked by skip connections at both ends. Through a reduction in the number of channels, $F_\textrm{2D}$ progressively approximates detailed features, thereby reconstructing the complete super-resolved video sequence.

5 Experiments

5.1 Datasets

A large-scale asset is important for training networks. In this experiment, we use the Vimeo-90K [25] dataset for training with a fixed resolution of 448 $\times $ 256. To prepare the training data, we first input the original HR frames into LPS-Net to generate the semantic segmentation results and then apply the spatial frequency computation and K-means algorithm on the semantic segmentation result maps as a way to obtain the masked images with the labels of “Simple, Medium, Complex" in the different regions, which are further cropped to the size of 256*256 before they are fed into the SR network. Another input to the SR network is the LR frame, which is downsampled by the “F.Interpolation” function, and then be cropped to a 64*64 size image. In addition, 10,000 frames from the Vimeo-90K dataset were selected for validation during the training process. For the testing dataset, we use three datasets, Vid4 [1], REDS [15], and four testing sets of frames from the Vimeo-90K dataset (Vimeo-90K-T). All these frames are the same pre-processed as training datasets.

5.2 Parameter settings

We implement the proposed network using the PyTorch platform. In the training process, we use ADAM optimizer with a batch size of 80 for training and mean square error (MSE) loss function. The initial learning rate started at 4e−4 and halved every 25 epochs. For all the results reported in the paper, the train was conducted for 80 epochs on the Nvidia A100 platform.

5.3 Comparisons with state-of-the-art methods

To verify the efficiency and high performance of the proposed algorithm, In this section, we compare the proposed with five previous VSR algorithms, which are D3Dnet [27], STAN [22], RSTT [6], RealBasicVSR [3], and STDAN [19]. We also present the results of bicubic interpolation as the baseline results.

5.3.1 Objective results

PSNR/SSIM are used as metrics to demonstrate the performance of the methods. Evaluation results of our original framework and compared algorithms achieved on the Vid4, REDS, and Vimeo-90K-T datasets are shown in Table 1. All the results are calculated in the Y channel. Moreover, the computational efficiency (the number of parameters and FLOPs) is also presented in Table 1. As it is not easy to calculate the detailed computational cost of 3D sparse convolution, we introduced images sourced from the REDS dataset and subsequently documented the proportion of pixel points entering the medium and complex branches relative to the total pixel count. This ratio serves to approximate the computational load occupied by the 3D sparse convolution on the respective branch.

Table 1 presents a compelling argument for the proposed algorithm’s capability to deliver high-quality VSR with exceptional computational efficiency. With only 0.78M parameters, the proposed algorithm exhibits a superior balance between the model complexity and performance metrics—achieving a PSNR of 29.23 and SSIM of 0.836 on average. This is especially notable when juxtaposed with other methods like D3Dnet and STAN, which, despite their larger parameter counts of 2.58M and 16.16M, respectively, do not proportionately outperform the proposed method in terms of average PSNR and SSIM. When examining the FLOPs, a similar pattern emerges. The proposed method operates at a mere 164.63G FLOPs, where the simple branch has approximately 134.28 GFLOPs, medium branch is around 27.43 GFLOPs, the rest is from complex branch with 2.92 GFLOPS. It is a fraction of the computational cost incurred by methods such as RSST and RealBasicVSR, which demand 1096.91G and 2601.67G FLOPs, respectively. Despite this, these methods do not achieve a commensurate increase in performance, with RSST achieving an average PSNR of 28.20 and SSIM of 0.814, and RealBasicVSR a PSNR of 26.23 and SSIM of 0.776.

On individual datasets, the proposed algorithm’s efficiency becomes even more pronounced. For instance, on the Vimeo-90K-T dataset, it reaches a PSNR of 32.09, which is significantly higher than that of the heavyweight competitor RealBasicVSR, which achieves 28.31 despite its greater computational demand. In the REDS dataset, the proposed method’s PSNR of 29.26 is strikingly close to the highest recorded PSNR of 29.29 by D3Dnet, yet with only about 30 % of D3Dnet’s parameter count and 40% of its FLOPs.

The proposed method also demonstrates consistency across the datasets, maintaining a narrow performance range, which suggests robustness and generalizability. This is in contrast to some methods that exhibit a wider fluctuation in performance across different datasets, such as STAN and RSTT, which may indicate a potential for a lack of adaptability to varying data characteristics. Consequently, the proposed algorithm distinguishes itself not only by its lean parameterization and reduced computational cost but also by its ability to maintain high fidelity in VSR tasks, which is particularly advantageous for real-time applications or devices with limited processing capabilities.

Table 2 Effectiveness of 3D sparse convolution in feature extraction for video super-resolution on the Vid4 dataset

Full size table

5.3.2 Subjective results

The SR results of the comparative results among different methods are shown in Figs. 4, 5, and 6. Seeing the “MAREE" in Fig. 4, the proposed technique produces a relatively clear and complete presentation than others, like STAN and RealBasicVSR. Moreover, the green light of our proposed in Fig. 6 is less blurring in edge than other methods. Although the subjective quality of the SR results generated by our proposed method may not be strikingly distinguished from those produced by alternative methods, it is imperative to consider the computational efficiency that it brings to the table. Our approach is characterized by its significantly reduced computational and parameter burden, a salient feature that stands out in the realm of super-resolution technology. Unlike other algorithms that demand considerable computational resources, our proposed method achieves comparable super-resolution outcomes with a fraction of the computational load. This advantage makes it particularly well suited for applications where processing power is at a premium, or real-time processing is required, thereby offering a pragmatic balance between performance and resource expenditure.

6 Ablation studies

To investigate the effect of the proposed modules in our algorithm, we conduct comprehensive ablation studies in this section.

6.1 Efficiency of 3D sparse

To increase the efficiency of the proposed method, 3D sparse convolution is applied to decrease the FLOPs. So, we compare the efficacy of our proposed method against the baseline model which operates without 3D sparse convolution (w/o Sparse-Conv). The comparison result can be seen from Table 2. It demonstrates that the proposed method offers a substantial reduction in computational complexity, it requires only 60 % of the baseline method. At the same time, we observed that both methods deliver nearly identical results. These marginal variances are well within the range of experimental noise, implying that there is no significant loss in image quality despite the lowered computational overhead. This reduction indicates a notable enhancement in computational efficiency, which is particularly advantageous for environments where processing resources are constrained or where rapid processing is essential.

Table 3 Effectiveness of incremental structure for video super-resolution on the Vid4 dataset

Full size table

Table 4 Effectiveness of sparse-TSFT module in feature fusion for video super-resolution on the Vid4 dataset

Full size table

6.2 Efficiency of incremental structure

The incremental structure consists of three branches, each can process the regions in different complexities. So, the more complex the region, the more convolutional layers it goes through. To test the effectiveness of the incremental structure, we removed the masks and only input the LR frames to the network, so there is not partitioned, in other words, all the features will experience all the convolutions in the network. The comparison between the baseline model (All-Branches) and the proposed method is shown in Table 3. Paying specific attention to the balance between computational cost and image reconstruction quality. Despite the “Proposed" method yielding marginally lower performance metrics compared to the “All-Branches" model, the difference in image quality is minimal. Crucially, this modest decrement is offset by a significant reduction in computational load, as the proposed method requires only 164.63G FLOPs—a reduction of 38.5% from the “All-Branches" model’s. This substantial decrease in computational requirements translates to a higher cost-effectiveness for the proposed method. By maintaining a high standard of image quality with far fewer computational resources, the proposed method demonstrates an enhanced cost-to-performance ratio. It leverages the reduced computational expense to deliver a nearly equivalent level of image fidelity, thus underscoring the method’s suitability for resource-constrained environments without substantive sacrifices in output quality.

6.3 Feature aggregation

To validate the effect of the proposed sparse-TSFT module in adaptive fusion of features, we establish a baseline model (w/o sparse-TSFT). It only adopts concatenation to directly add local and global features in each step without emphasizing the local detail features. The quantitative results on the Vid4 dataset are shown in Table 4, the values of PSNR and SSIM of the baseline model are significantly decreased with only a few FLOPs less and the same parameter counts. Also, we present the detailed feature maps from the “simple,” “medium” and “complex” branches in Fig. 7, which are from “000" in the REDS dataset. Also, the corresponding mask and the HR frames are posted in Fig. 8 for a clearer comparison. These figures confirm that the sparse-TSFT module can acquire more helpful and detailed content in feature aggregation.

7 Discussion and future work

Our current investigation has revealed potential avenues for improvement within our network’s architecture. Notably, the 3D Resblock and 3D Sparse Resblock modules, while adept at capturing spatial information, may not be fully optimizing the temporal correlations present in video sequences. This underutilization could be a contributing factor to the suboptimal performance when compared to state-of-the-art methods, particularly in scenarios involving complex motion patterns and temporal details. Future iterations of our model will explore the integration of recurrent neural networks (RNNs) or attention mechanisms that span the time dimension to enhance the temporal feature extraction capabilities of these blocks. Moreover, the 3D Sparse Resblock’s capacity to encapsulate sparse features could be further exploited. We are considering refining the sparsity constraints or adopting alternative sparse representation techniques to allow for a richer feature set that can be more discriminative and informative for super-resolution tasks.

In addition, our chosen spatial upscaling technique, the pixel shuffle operation, might not be intricately reconstructing the high-frequency details that are crucial for the clarity and richness of super-resolved images. This could be a contributing reason for our method’s limitations in achieving the desired image quality, especially when compared to the intricate textures present in the ground truth high-resolution counterparts. To surmount this challenge, our future work will delve into alternative upscaling methods. For example, exploring the potential of transpose convolution techniques and testing the ability of generative adversarial networks (GANs) to synthesize fine details, to ascertain their effectiveness in bridging the gap between the current outcomes and the high-fidelity super-resolution benchmarks.

Collectively, these enhancements are expected to mitigate the identified deficiencies. This progression will involve extensive empirical evaluations to ensure that any modifications align with our objective of advancing the fidelity of super-resolution video reconstruction. By continuously refining our approach, we aspire to not only address the limitations observed but also to contribute novel insights into the field of video super-resolution.

8 Conclusion

In this paper, we introduce an innovative semantic guidance incremental 3D sparse convolutional network that successfully addresses the challenge of computational complexity in VSR while maintaining an excellent level of video quality. Specifically, the method effectively reduces the computational cost of the network and solves the resource consumption problem often faced by VSR models in practical applications. This innovation provides a feasible solution for realizing high-quality video super-resolution with limited computational resources and provides strong support for the wide application of VSR technology. Traditional VSR methods indiscriminately process entire video frames, resulting in inefficient computational allocation to less pertinent regions and suboptimal enhancement of detail-rich segments.

Our proposed framework employs semantic segmentation algorithms to pre-process low-resolution frames, effectively partitioning the feature maps according to semantic density before their introduction into the VSR neural network. This method facilitates a discriminating processing strategy, where only informationally dense segments undergo the sophisticated super-resolution procedure. Consequently, this tailored processing significantly conserves computational resources and increases the VSR efficiency. The adoption of 3D sparse convolution extracts the segmented sparse feature maps, allocating computational efforts prudently. Notably, this utilization of 3D sparse convolution on natural datasets is pioneering, extending its prior application from point cloud data to VSR, thereby presenting a novel paradigm in the field. Finally, we design a sparse-TSFT module for feature fusion. This module effectively reduces problems such as edge blur and artifacts that may be caused by regional fusion and improves the quality of SR results.

References

Caballero, J., Ledig, C., Aitken, A., Acosta, A., Totz, J., Wang, Z., Shi, W.: Real-time video super-resolution with spatio-temporal networks and motion compensation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4778–4787 (2017)
Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5972–5981 (2022)
Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: Investigating tradeoffs in real-world video super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5962–5971 (2022)
Dawood, M.S., Benazer, S.S., Karthick, R., Ganesh, R.S., Mary, S.S.: Performance analysis of efficient video transmission using EvalSVC, EvalVid-NT, EvalVid. Mater. Today Proc. 46, 3848–3850 (2021)
Article Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M., Hertzmann, A., Shechtman, E.: Controlling perceptual factors in neural style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3985–3993 (2017)
Geng, Z., Liang, L., Ding, T., Zharkov, I.: Rstt: Real-time spatial temporal transformer for space-time video super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17,441–17,451 (2022)
Kong, X., Zhao, H., Qiao, Y., Dong, C.: Classsr: a general framework to accelerate super-resolution networks by data characteristic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12,016–12,025 (2021)
Ledig, C., Shi, W., Bai, W., Rueckert, D.: Patch-based evaluation of image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3065–3072 (2014)
Lee, R., Venieris, S.I., Lane, N.D.: Deep neural network-based enhancement for image and video streaming systems: a survey and future directions. ACM Comput. Surv. 54(8), 1–30 (2021)
Google Scholar
Li, G., Ji, J., Qin, M., Niu, W., Ren, B., Afghah, F., Guo, L., Ma, X.: Towards high-quality and efficient video super-resolution via spatial-temporal data overfitting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10,259–10,269 (2023)
Liang, J., Fan, Y., Xiang, X., Ranjan, R., Ilg, E., Green, S., Cao, J., Zhang, K., Timofte, R., Gool, L.V.: Recurrent video restoration transformer with guided deformable attention. Adv. Neural. Inf. Process. Syst. 35, 378–393 (2022)
Google Scholar
Liu, D., Wang, Z., Fan, Y., Liu, X., Wang, Z., Chang, S., Huang, T.: Robust video super-resolution with learned temporal dynamics. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2507–2515 (2017)
Liu, J., Lu, M., Chen, K., Li, X., Wang, S., Wang, Z., Wu, E., Chen, Y., Zhang, C., Wu, M.: Overfitting the data: compact neural video delivery via content-aware feature modulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4631–4640 (2021)
Lu, Z., Xiao, Z., Bai, J., Xiong, Z., Wang, X.: Can SAM boost video super-resolution? arXiv preprint arXiv:2305.06524 (2023)
Nah, S., Baik, S., Hong, S., Moon, G., Son, S., Timofte, R., Mu Lee, K.: Ntire 2019 challenge on video deblurring and super-resolution: dataset and study. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0 (2019)
Ren, W., Pan, J., Cao, X., Yang, M.H.: Video deblurring via semantic segmentation and pixel-wise non-linear kernel. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1077–1085 (2017)
Tao, X., Gao, H., Liao, R., Wang, J., Jia, J.: Detail-revealing deep video super-resolution. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4472–4480 (2017)
Tian, Y., Zhang, Y., Fu, Y., Xu, C.: Tdan: Temporally-deformable alignment network for video super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3360–3369 (2020)
Wang, H., Xiang, X., Tian, Y., Yang, W., Liao, Q.: STDAN: deformable attention network for space-time video super-resolution. IEEE Trans. Neural Netw. Learn. Syst. (2023)
Wang, X., Chan, K.C., Yu, K., Dong, C., Change Loy, C.: Edvr: Video restoration with enhanced deformable convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019)
Wang, X., Yu, K., Dong, C., Loy, C.C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 606–615 (2018)
Wen, W., Ren, W., Shi, Y., Nie, Y., Zhang, J., Cao, X.: Video super-resolution via a spatio-temporal alignment network. IEEE Trans. Image Process. 31, 1761–1773 (2022)
Article Google Scholar
Xiao, Z., Xiong, Z., Fu, X., Liu, D., Zha, Z.J.: Space-time video super-resolution using temporal profiles. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 664–672 (2020)
Xu, K., Yu, Z., Wang, X., Mi, M.B., Yao, A.: An implicit alignment for video super-resolution. arXiv preprint arXiv:2305.00163 (2023)
Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with task-oriented flow. Int. J. Comput. Vis. 127, 1106–1125 (2019)
Article Google Scholar
Yi, P., Wang, Z., Jiang, K., Jiang, J., Ma, J.: Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3106–3115 (2019)
Ying, X., Wang, L., Wang, Y., Sheng, W., An, W., Guo, Y.: Deformable 3d convolution for video super-resolution. IEEE Signal Process. Lett. 27, 1500–1504 (2020)
Article Google Scholar
Zhang, A., Ren, W., Liu, Y., Cao, X.: Lightweight image super-resolution with superpixel token interaction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12,728–12,737 (2023)
Zhang, H., Liu, D., Xiong, Z.: Two-stream action recognition-oriented video super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8799–8808 (2019)
Zhang, Y., Yao, T., Qiu, Z., Mei, T.: Lightweight and progressively-scalable networks for semantic segmentation. Int. J. Comput. Vis. pp. 1–19 (2023)
Zhu, S., Urtasun, R., Fidler, S., Lin, D., Change Loy, C.: Be your own Prada: fashion synthesis with structural coherence. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1680–1688 (2017)

Download references

Funding

Open Access funding provided by the IReL Consortium.

Author information

Authors and Affiliations

Technological University of the Shannon: Midlands Midwest, University Road, Athlone, N37 HD68, Ireland
Xiaonan He, Yuansong Qiao, Brian Lee & Yuhang Ye
Jiangxi College of Foreign Studies, Nanchang, 330099, Jiangxi, China
Yukun Xia

Authors

Xiaonan He
View author publications
Search author on:PubMed Google Scholar
Yukun Xia
View author publications
Search author on:PubMed Google Scholar
Yuansong Qiao
View author publications
Search author on:PubMed Google Scholar
Brian Lee
View author publications
Search author on:PubMed Google Scholar
Yuhang Ye
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yuhang Ye.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

He, X., Xia, Y., Qiao, Y. et al. Semantic guidance incremental network for efficiency video super-resolution. Vis Comput 40, 4899–4911 (2024). https://doi.org/10.1007/s00371-024-03488-y

Download citation

Accepted: 13 May 2024
Published: 02 July 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s00371-024-03488-y

Keywords

Profiles

Yukun Xia View author profile

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Part of a collection:

Semantic guidance incremental network for efficiency video super-resolution

Abstract

Similar content being viewed by others

Attention-guided video super-resolution with recurrent multi-scale spatial–temporal transformer

A comparative study of super-resolution algorithms for video streaming application

Deformable Spatial-Temporal Attention for Lightweight Video Super-Resolution

Explore related subjects

1 Introduction

2 Related work

2.1 Video super-resolution

2.2 Semantic guidance

3 Motivation

4 Methodology

4.1 Backbone

4.2 3D sparse convolution

4.3 Sparse temporal/spatial feature transformation layer (sparse-TSFT)

4.4 Feature reconstruction module

5 Experiments

5.1 Datasets

5.2 Parameter settings

5.3 Comparisons with state-of-the-art methods

5.3.1 Objective results

5.3.2 Subjective results

6 Ablation studies

6.1 Efficiency of 3D sparse

6.2 Efficiency of incremental structure

6.3 Feature aggregation

7 Discussion and future work

8 Conclusion

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles