A blind image super-resolution network guided by kernel estimation and structural prior knowledge

Zhang, Jiajun; Zhou, Yuanbo; Bi, Jiang; Xue, Yuyang; Deng, Wei; He, Wenlin; Zhao, Tao; Sun, Kai; Tong, Tong; Gao, Qinquan; Zhang, Qing

doi:10.1038/s41598-024-60157-9

Download PDF

Article
Open access
Published: 25 April 2024

A blind image super-resolution network guided by kernel estimation and structural prior knowledge

Jiajun Zhang¹^na1,
Yuanbo Zhou¹^na1,
Jiang Bi²,
Yuyang Xue³,
Wei Deng⁴,
Wenlin He²,
Tao Zhao²,
Kai Sun²,
Tong Tong¹,
Qinquan Gao¹ &
…
Qing Zhang⁵

Scientific Reports volume 14, Article number: 9525 (2024) Cite this article

136 Accesses
Metrics details

Subjects

Abstract

The goal of blind image super-resolution (BISR) is to recover the corresponding high-resolution image from a given low-resolution image with unknown degradation. Prior related research has primarily focused effectively on utilizing the kernel as prior knowledge to recover the high-frequency components of image. However, they overlooked the function of structural prior information within the same image, which resulted in unsatisfactory recovery performance for textures with strong self-similarity. To address this issue, we propose a two stage blind super-resolution network that is based on kernel estimation strategy and is capable of integrating structural texture as prior knowledge. In the first stage, we utilize a dynamic kernel estimator to achieve degradation presentation embedding. Then, we propose a triple path attention groups consists of triple path attention blocks and a global feature fusion block to extract structural prior information to assist the recovery of details within images. The quantitative and qualitative results on standard benchmarks with various degradation settings, including Gaussian8 and DIV2KRK, validate that our proposed method outperforms the state-of-the-art methods in terms of fidelity and recovery of clear details. The relevant code is made available on this link as open source.

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation

Article 07 December 2020

A visual-language foundation model for computational pathology

Article 19 March 2024

Neural étendue expander for ultra-wide-angle high-fidelity holographic display

Article Open access 22 April 2024

Introduction

The task of image super-resolution (SR) is to reconstruct clear high-resolution images from low-resolution images. Image degradation is often considered as the inverse problem of SR, as it involves mathematically modeling the processes that deteriorate the quality of image. According to previous works^1,2,3,4,5, the pipeline of degradation is typically modeled as Eq. (1).

$$\begin{aligned} y = (x *k_{h})_{\downarrow _{s}}+n, \end{aligned}$$

(1)

where x represents the high resolution (HR) image, while y corresponds to the low resolution (LR) image. The operator $*$ denotes the two-dimensional convolution operation and $k_{h}$ is the Gaussian kernel, $\downarrow _{s}$ means downsampling operation with a scale factor of s, n refers to additive Gaussian white noise (AGWN). The classical SR methods^6,7,8 assumes that the degradation pipeline is a single bicubic downsampling. However, if the predefined degradation does not exactly match the practical situation, the reconstructed HR image may exhibit unpleasant artifacts¹. Therefore, recovering shape edges and rich details in the case of LR images with unknown degradation^{1,2,5,9,10,11,12}, is an extremely meaningful and challenging task.

The most common blind SR schemes are typically divided into two stages: the first stage is to model the kernel explicitly or implicitly through optimizing a deep neural network from the degraded image^1,2,3,4,5,9, and the second stage inputs the LR image combined with additional degradation prior through the SR network to obtain reconstructed HR image. In first stage, the mismatch between estimated blur kernel and the actual one can lead to over-smoothed or over-sharpened results^1,2,3. An available solution is to perform accurate estimation of the kernel^1,9 and robust integration with the SR backbone^2,3,5.

Recent research^1,2,3,4,5,9 has mainly concentrated on the first stage of kernel modeling. DCLS³ proposes a robust dynamic kernel estimation network and introduces a module to achieve degradation representation embedding. However, its SR network has limited ability to represent spatial features, making it difficult to recover structural information well. Fig. 1 shows the reconstruction results of state-of-the-art methods and our method for structural textures. It can be observed obviously that current methods lacks the combination of structural prior knowledge, making the ambiguous details and edges in the recovered SR image.

It is broadly recognized that non-local operations^15,16, which introduce self-similarity priors, are significant for recovering recurring textures within the same image. Moreover , the spatial attention and channel attention mechanisms can effectively capture local features. Motivated by these observations, we propose a network combined kernel estimation and structural prior knowledge that can leverage both local spatial and global features to boost reconstruction performance for images with high self-similarity. To be specific, we employ the deep constrained least squares³ (DCLS) block as the module to deblur the original feature $f_o$, in order to obtain a clean feature $f_c$. Next, we divide the original feature $f_o$ into two vectors along the channel dimension: $\widehat{f_o}$, and $\overline{f_o}$. These three vectors $f_c$, $\overline{f_o}$, and $\widehat{f_o}$, are together fed into a series of triple path attention blocks (TPAB) to perform deep feature extraction and utilize local spatial information to compensate for the gap caused by kernel estimation. Furthermore, the global texture fusion block (GTFB) adaptively adjusts the self-similarity scores of non-local features to achieve the embedding of global structural prior. We have performed several standard experiments on benchmarks with various degradation settings to evaluate our proposed method. The quantitative and qualitative results demonstrate that our network has excellent performance in all datasets, particularly for images with rich structural information. The main contributions of this paper are summarized as follows:

We propose a blind SR network, capable of combining kernel estimation with structural prior knowledge to reconstruct the textures with high self-similarity.
We employ a channel split strategy to take advantage of the original local spatial and channel features in order to compensate for artifacts generated by the kernel estimation and the deblurring operation.
We design a global texture fusion block that aggregates local spatial features with non-local operations to enhance recovery performance in images with high self-similarity.
Extensive experiments with various degradation settings demonstrate that our method achieves outstanding performance in the task of blind SR.

Related work

SR of bicubic and multiple degradation

The pioneering work of SRCNN⁶ has successfully motivated interest among researchers in the field of SR. Inspired by hierarchical architecture^7,8,17 and robust loss function^{11,12,18,19,20,21}, CNN-based methods have achieved outstanding performance on predefined bicubic downsampling in the SR task, while the degradation process in the real-world are generally unknown and complicated^11,12. In practical applications, if the bicubic kernel assumed by classical methods does not match the actual degradation kernel, it will lead to unpleasant artifacts in the reconstructed SR image, severely affecting the visual perception quality. This discrepancy between the assumed kernel and the actual kernel give rise to domain gap^22,23,24, which is a challenge in practical applications of SR.

Another approach to non-blind SR method^{4,25,26,27,28} is designed to super-resolve multiple types of degraded images with corresponding kernels. These methods make classical SR networks more robust and applicable to a wider range of real-world scenarios. FFDNet²⁵ utilizes a noise level map as additional input, allowing it to handle various noisy images affected by different types of degradation. Similarly, SRMD⁴ proposes a kernel stretching strategy that incorporates the two degradation parameters, the blur kernel k and the noise level n, together with the LR as input to SR network. Zhang et al.²⁹ combines learning-based methods with model-based methods to design an end-to end unfolding networks that can handle various types of degraded images with different scales. UDVD²⁷ introduces dynamic convolution in the kernel estimation network, where the parameters of the filters can be dynamically adjusted based on the adaptivity of the input degraded image. KMSR²⁶ utilizes generative adversarial networks to learn the distribution of kernels in real degraded images. Inspired by KMSR²⁶, Son et al.²⁸ propose an adaptive downsampling model that employs an unsupervised approach to simulate the actual degradation process of real-world images. They then synthesize paired data and develop an SR network capable of handling various types of degradation.

SR of unknown kernel

The most common approach for the blind SR task is based on kernel estimation methods^{1,2,3,4,5,9,30}. KernelGAN⁹ utilizes cross-scale image similarity to accomplish kernel estimation on specific images and combined it with a classical method¹³ to achieve blind reconstruction. MANet³⁰ further investigates spatially variant blur kernels in order to super-resolve objection motion and out-of-focus in real world scenarios. Gu et al.¹ use an iterative correction method to alleviate the effects caused by the mismatch between estimated result and practical kernel. Luo et al.^2,5 adopt an end-to-end network to alternately optimize estimator and restorer. These two methods^1,2 are effective but time-consuming owing to the elaborate optimization steps. DCLS³ reformulates a practical degradation model and proposes a deep constrained least squares module to operate deconvolution in order to achieve robust degradation awareness. In the aforementioned methods^{1,2,3,5,9,22,23}, the solution is concentrated on modeling degradation either implicitly^22,23,31 or explicitly^{1,2,3,4,5,9,10,32} without delving into the function of structural textures as prior knowledge. This may be a potential factor leading to the upper bound of blind SR performance.

Method

Architecture

In this subsection, we will introduce the overall architecture of our model. As shown in Fig. 2, our method mainly contains two stages: degradation representation embedding, and texture details recovery. The first stage includes the dynamic kernel estimation and deblurring operation based on the DCLS³ module. The estimator $N_e$ accomplishes robust kernel estimation from degraded LR image. Next, the LR image and the estimated blur kernel k are jointly input into the DCLS module for deblurring. Lastly, the clean and original shallow features are fed into the triple path attention network to achieve local and global features fusion, which consists of triple path attention blocks (TPAB) and global texture fusion blocks (GTFB). Details on the pipeline of our method and the relevant blocks will be described in the following subsections.

Degradation representation embedding

Inspired by the work of³, our method employs the dynamic kernel estimation, as shown in Fig. 3. Given an LR image with unknown degradation as input, three residual blocks are applied to extract deep features $f_s$, followed by global average pooling to obtain the flattened features $\overline{f_s}$. The fully connected layer maps the specific degradation information to the four various filters, $\widehat{h_0}$ , $\widehat{h_1}$ , $\widehat{h_2}$ , and $\widehat{h_3}$ , with kernel sizes set to $11\times 11$, $7\times 7$, $5\times 5$ and $1\times 1$, respectively, to adjust the receptive filed consistency with the kernel sizes of predicted kernel k. The process of dynamic estimation is shown in Eq. (2).

$$\begin{aligned} k = I_k*\widehat{h_0}*\widehat{h_1}*\widehat{h_2}*\widehat{h_3}, \end{aligned}$$

(2)

where $I_k$ is the identity kernel, and $\widehat{h_0}$ , $\widehat{h_1}$ , $\widehat{h_2}$ , and $\widehat{h_3}$ are specific filters mapped from degradation information, k is the estimated kernel through Estimator $N_e$. The $I_k$ is sequentially convolved with these filters, enabling the parameters in network $N_e$ to vary with different degraded inputs. Meanwhile, the DCLS³ module utilizes deconvolutional operations to obtain clean feature as Eq. (3).

$$\begin{aligned} f_c = DCLS_{deconvolve}(f_{o},k), \end{aligned}$$

(3)

where $f_{o}$ represents the blurry original features extracted by a $3\times 3$ convolution layer and three residual blocks from the LR image, k is the kernel predicted by the network $N_e$, $f_{c}$ represents the deblurred clean features through the deconvolutional operation via the DCLS³ module.

Texture details recovery

Even with introducing deconvolutional operation through the DCLS³ module, the damaged high-frequency information cannot be fully restored. Therefore, we propose a novel network that not only strongly extracts local features to compensate for the decline of high-frequency components but also incorporates non-local^15,16 operation to fuse the local and global features.

Figure 2 illustrates the proposed SR network, mainly consists of the extraction process of original features and the fusion process of local features with global features. A $3\times 3$ convolutional kernel and three residual blocks without batch normalization³³ is used to extract original features $f_o$ as Eq. (4).

$$\begin{aligned} f_o = h_{Reslobck}(h_{conv}(I_{LR})), \end{aligned}$$

(4)

where $I_{LR}\in {R^{H\times W\times C}}$ is an LR image as input, H and W represent the height and width of the patch that is cropped from a sub-image, and C is the RGB channels in the image.

In previous stages we have obtained clean features $f_c$. FAIG³⁴ demonstrates that one branch network without degradation prior can achieve comparable performance to the two-branch method with degradation information. Although it may be reasonable to directly use the clean feature $f_c$ as input to the SR network for recovery, the offset of kernel estimation^9,30 and insufficiency of deblurring function in the DCLS³ module would prevent the SR network from effectively restoring highly structured textures in the SR backbone. Therefore, we propose a Triple Path Attention Group (TPAG) to extract deep feature f as Eq. (6).

$$\begin{aligned} \psi (fc,\overline{f_o},\widehat{f_o})&= h_{GTFB}(h^n_{TPAB}(fc,\overline{f_o},\widehat{f_o})), \end{aligned}$$

(5)

$$\begin{aligned} f&=\psi _N(\psi _{N-1}(\psi _2(\cdots \psi _1((fc,\overline{f_o},\widehat{f_o})))), \end{aligned}$$

(6)

where the $\psi (fc,\overline{f_o},\widehat{f_o})$ represents TPAG that adopts the clean feature $f_c$, chunked original feature $\overline{f_o}$ and $\widehat{f_o}$ as additional inputs, $h_{GTFB}(h^n_{TPAB})$ means that the group is composed of n Triple Path Attention Blocks (TPAB) and one Global Texture Fusion Block (GTFB). f is the deep clean feature, N is the number of TPAG in our SR network.

In addition, we further refine the deep feature f through a $3\times 3$ convolutional layer with the original low-frequency feature $f_{o}$ connected through long skip connections^7,8,35,36, as Eq. (7).

$$\begin{aligned} I_{SR} = h_{upsample}(h_{conv}(f)+f_{o}). \end{aligned}$$

(7)

Finally, pixel shuffle³⁷ serves as the upsampling module and completes the mapping from feature maps to HR image $I_{SR}$.

Triple path attention block

Deep SR networks contain specific filters that can handle various types and levels of degraded images³⁴. These specific filters, which can be used to address corresponding degradation such as noise and blur, are located at different positions and branches within a single SR network. Channel attention^8,36,38,39 and spatial attention^40,41 mechanisms can enhance the local modeling ability. Therefore, we introduce these mechanisms as two branches in TPAB, allowing the network to strengthen its generalization and better handle different types of degradation.

The triple path attention blocks, consisting of residual channel attention and residual local spatial blocks, is shown in Fig. 2. The original shallow features $f_{o}$ are split into two feature maps $\overline{f_o}$ and $\widehat{f_o}$ along the channel dimension. They are combined with the deblurred clean features ${f_c}$ and passed through TPABs to refine local texture features and compensate for the loss of high-frequency texture details. Specifically, $\overline{f_o}$ and $\widehat{f_o}$ are processed respectively by residual channel attention branches⁸ and residual local spatial branches⁴¹ to extract deep local features. Meanwhile, $\overline{f_o}$ and $\widehat{f_o}$ are concatenated with $f_o$ and fused by a convolutional layer. Lastly, the aggregated local features pass through a GTFB to establish connections between local and non-local features.

Global texture fusion block

Non-local^15,16,42 operations are capable of capturing long-range dependencies between different parts of an image, addressing the limitation of receptive filed by introducing self-attention mechanisms that enable each position to attend to all other positions in the input data. This operation is particularly instrumental in restoring structural textures that exhibit strong self-similarity. Previous researchers^15,42 hypothesized that non-local textures with higher similarity scores would be more advantageous for restoring edge information. However, they overlooked an objective fact that when an image suffers from severe degradation, non-local textures with low similarity scores may actually be more useful for restoring edges¹⁶.

Fusing the local spatial texture features without careful consideration does not significantly improve the network’s ability to restore textures. Therefore, we cascade a global texture feature fusion block (GTFB) at the end of each TPAG. In the module, we adopt the global learnable attention block¹⁶ after the local feature fusion. The global learnable attention block adaptively adjusts the similarity scores of non-local textures, allowing the network to effectively utilize non-local textures that previously had low similarity scores but can provide rich details.

As shown in Fig. 4, we input the feature map $X\in R^{H\times W \times C}$ as the input and convert X into three 1D vectors Q, L and $V\in R^{C\times HW }$ to achieve global attention mechanism. Super-Bit Locality-Sensitive Hashing (SB-LSH) divides the feature map into buckets to reduce computation costs, as shown in the Eq. (8).

$$\begin{aligned} \lambda _i = \left\{ x_j|argmax(MX_i)=argmax(MX_j) \right\} , \end{aligned}$$

(8)

where $M \in R^{b\times c}$ is a randomly initialized orthogonal matrix and b is the number of hash buckets, $X_i\in R^C$ is the $i-th$ component of $Q_i$, $\lambda _i$ is the index set corresponding to $Q_i$. Next, we use learnable similarity score $X_l$ (LSS) and fixed dot product similarity score $X_f$ (DPSS) to measure self-similarity as Eq. (9).

$$\begin{aligned} S(X_i) = S_f(X_i)+S_l(X_i), \end{aligned}$$

(9)

where $S_f(X_i)=X^T_i X_i$, $S_l(X_i)$ is defined as Eq. (10).

$$\begin{aligned} S_l(X_i) = (W_2\sigma (W_1L[\lambda _i]+b_1)+b_2) , \end{aligned}$$

(10)

where $\sigma$ is the ReLU activation and $W_1, W_2, b_1, b_2$ are learnable parameters.

Table 1 The quantitative results on benchmarks with Gaussian8 kernels.

Full size table

Table 2 The quantitative comparison on benchmarks with Gaussian8 kernels and various noise levels.

Full size table

Loss function

Our model includes the kernel estimation task and the reconstruction task. We jointly optimize our model using $L_1$ Loss $L_{kernel}$ and Charbonnier Loss $L_{pixel}$, as shown in the Eq. (11).

$$\begin{aligned} L_{total} = L_{kernel}+L_{pixel}, \end{aligned}$$

(11)

where the $L_{kernel}=||k-k_l||$ is the $L_1$ loss between estimated kernel k and the ground truth blur kernel $k_l$. The pixel loss is defined as $L_{pixel}=\sqrt{(I_{SR}-I_{HR})^2+\epsilon }$, where $I_{SR}$ and $I_{HR}$ denote the super-resolved image and the ground-truth HR image, $\epsilon$ is a constant and usually $1\times 10^{-6}$.

Experiments

Datasets and implementation details

Datasets and metrics

Following previous work^1,2,5, we used the DIV2K⁵⁰ (800) and the Flickr2K⁵¹ (2650) as the training data, which together contain 3450 2K HR images. We adopt both isotropic and anisotropic Gaussian kernels as assumed degradation to synthesize corresponding LR images according to Eq. (1). The experimental results are evaluated using the PSNR and SSIM⁵² metrics for fidelity, which are only calculated on the Y channel of the YCbCr color space.

Isotropic Gaussian kernels

In the setting 1, isotropic Gaussian kernels are first applied in our study as the same in^1,2,3,5. The kernel size is fixed to $21\times 21$ during both the training and testing phases. During the training process, we randomly sampled the kernel width from the ranges of [0.2, 2.0] , [0.2, 3.0] , and [0.2, 4.0] uniformly for scale factors of 2, 3, and 4, respectively. During the testing phase, we used Gaussian8 kernels to degrade five benchmarks, including Set5⁴³, Set14⁴⁴, B100⁴⁵, Urban100⁴⁶, and Manga109⁴⁷. Gaussian8 uniformly selects 8 kernels from the ranges [0.80, 1.60], [1.35, 2.40], and [1.80, 3.20] for scale factors 2, 3, and 4, respectively. Subsequently, the HR images are convolved with 8 various blur kernels and downsampled to obtain corresponding LR images.

Anisotropic Gaussian kernels

In the setting 2, anisotropic Gaussian kernels were employed in our study follwing the work in^1,2,3,5,9. The kernel size is $11\times 11$ and $31\times 31$ for scale factors 2 and 4 respectively in the training stages. During the training process, we randomly sampled the kernel width from the ranges of [0.6, 5] and rotated it from the range $[$ $-\pi$ $,$ $\pi$ $]$. During the testing process, blind SR benchmark DIV2KRK⁹ were used for evaluation.

Implementation details

We cropped the training data into sub-images of size $480\times 480$, and utilized LR patches of size $64\times 64$ to feed into our model. Our SR network consists of 6 groups of TPAG, each consisting of 11 TPABs and 1 GTFB. We trained the model using 8 RTX2070 GPUs, with a batch size of 4 for each GPU. The initial learning rate was $1\times 10^{-4}$ and decayed by half at every $2\times 10^{5}$ iterations, the total number of iterations was $1\times 10^{6}$. We used the Charbonnier loss²¹ as loss function and Adam⁵³ optimizer with $\beta _1$ 0.9 and $\beta _2$ 0.99 for optimization. We also adopt horizontal flipping and $90^{\circ }$ rotation as data augmentation strategies during the training phase.

Comparison with state-of-the arts

Evaluation with isotropic Gaussian kernels

We have evaluated our method on benchmarks synthesized by Gaussian8 kernels and compared its performance with those using state-of-the-art blind SR methods, including ZSSR¹³, IKC¹, DANv1⁵, DANv2², AdaTarget¹⁴, KOALAnet³², and DCLS³. Additionally, CARN⁴⁸ as a lightweight non-blind SR model that combined with blind deblurring⁴⁹ method was also implemented for comparison.

The quantitative comparisons on benchmarks with Gaussian8 kernels are shown in Table 1. Our method achieves remarkable results on various benchmarks, particularly exhibiting noticeable performance on datasets with strong self-similarity, such as Urban100⁴⁶ and Manga109⁴⁷, nearly + 0.16dB and + 0.15dB than DCLS³ on $\times$4 factor. Bicubic interpolation and CARN⁴⁸ are non-blind SR methods that assume a known bicubic degradation, which deviates from the actual situation, resulting in a severe drop in performance. ZSSR¹³ utilizes the internal statistics of patch recurrence to build an image-specific super-resolution method that does not require external datasets. This approach slightly improves performance due to the lack of abundant training data and powerful fitting ability. Performing the blind deblurring⁴⁹ operation on the reconstructed image can moderately improve performance by reducing artifacts caused by domain gap. Conversely, applying the inverse operation may further damage details in the LR image, leading to unsatisfactory SR results. The IKC¹ and DAN⁵ compensate for the offset caused by kernel estimation through iterative correction and end-to-end alternate optimization, respectively, significantly improving the performance. DCLS³ can retain the spatial information of the blur kernel while introducing dynamic convolution to boost the robustness of estimation, thus achieving superior performance.

Our proposed TPAB compensates for the attenuation of high-frequency components caused by the DCLS³ deconvolution module and the GTFB integrates non-local features with low similarity scores to assist in the fusion of local and global features. The qualitative visual results in Fig. 5 also demonstrate that our method is capable of recovering sharp edges and rich details. Furthermore, considering the complexity of actual degradation, we conduct an extra experiment to handle images with Gaussian8 kernels and additional noise. The quantitative results, shown in Table 2, validate that our method also has a certain degree of robustness to additional noise.

Table 3 The quantitative results on DIV2KRK benchmark with isotropic Gaussian kernel.

Full size table

Table 3 shows the quantitative results of these methods on the DIV2KRK⁹ dataset. The results indicates that ZSSR¹³ can serve as a method for improving bicubic interpolation performance. When combined with the kernel estimation by KernelGAN⁹ as a prior, the performance of ZSSR¹³ is further improved. SRMD⁴ shows the consistently with bicubic interpolation. Classical SR methods such as RCAN⁸, EDSR⁷, and DBPN⁵⁴, which adopted paired training data degraded by bicubic downsampling, suffer an extreme decrease in performance due to domain gap. The correction filter⁵⁵ modifies the blurry image to match bicubic kernel, significantly improving the performance of DPBN⁵⁴ trained on bicubic kernel.

Among the remaining blind SR methods, which contain IKC¹, DAN^2,5, KOALAnet³², AdaTarget¹⁴,and DCLS³, our method performed slightly superior than the DCLS³. This circumstance is consistent with our hypothesis. Due to the wild degradation of the DIV2KRK⁹ dataset, the textures and edges are damaged severely. The compensation of TPAB module for high-frequency features is limited. GTFB cannot accurately adjust the similarity score of local textures, resulting in the reconstruction of high-frequency information that is not as good as isotropic Gaussian kernels with mild degradation.

Ablation study and discussion

Table 4 The details of ablation study.

Full size table

Table 5 The ablation study on benchmarks with Gaussian8 kernels.

Full size table

In this subsection, we performed a series of ablation experiments on the two crucial modules proposed by us, TPAB and GTFB, to quantitatively study their contributions to our method. The specific settings related to the ablation experiments are shown in the Table 4.

Firstly, the DCLS³ adopt clean feature $f_c$ with original $f_o$ as input to feed into Double Path Attention Groups (DPAG) to reconstruct HR images. The DCLS was used as baseline to explore the function of our proposed modules TPAB and GTFB.

Secondly, we placed DPAG with our proposed TPAG, where original feature $f_o$ was split into $\overline{f_o}$ and $\widehat{f_o}$ to extract channel and spatial local feature to compensate for high-frequency decline. In this setting, without the function of global feature fusion, the single GTFB was placed by a TPAB. It can be observed from Table 5 that adding only the TPAB module resulted in a minimal improvement in performance(+ 0.02db in Set14⁴⁴ and + 0.01dB in Manga109⁴⁷). This may be because the depth of TPAG is already sufficient for extracting degradation feature, and using TPAB alone to capture local texture features has limited compensatory effects on high-frequency information.

Lastly, we utilized a variant network consisting of Double Path Attention blocks (DPAB) and Global texture fusion block to evaluate the contribution of GTFB, we appended a GTFB in each DPAG. The results shows a similar trend to the previous experiments, indicating GTFB could better utilize non-local textures to reconstruct high-frequency details. However, due to the lack of tiny compensation from the TPAB module, there is only a moderate performance improvement(about + 0.05dB in Urban100⁴⁶), and the ability to reconstruct texture information was still insufficient.

Performance on real degradation

To further demonstrate the effectiveness of our method, we utilized the proposed model with isotropic Gaussian kernels and additional noise level 15 on real degradation images where the degradation is complicated and unknown. Our model was compared with classical real-world super resolution methods including RealSR¹⁰, BSRGAN¹¹, Real-ESRGAN¹², DASR³¹, and MM-RealSR⁵⁶ on Real20¹¹ dataset. An example of super-resolving chip image is shown in Fig. 6. Our method still produce rich details and sharp edges.

Discussion

The specific results of the ablation experiments are shown in Table 5. It is evident that adding either module alone only results in a marginal performance gain(approximately + 0.05dB in Set14⁴⁴ and BSD100⁴⁵). However, the flexible combination of two modules achieves astonishingly higher performance (+ 0.16dB and + 0.13dB in Urban100⁴⁶ Manga109⁴⁷ respectively than only one module). One possible reason is that even slight compensation of high-frequency information is crucial for the adaptive adjustment of similarity scores in global learnable attention¹⁶ block. With the aggregation of local features on both channel and spatial dimensions introduced by the TPAG module, the GTFB exhibits a stronger ability to fuse global information.

Limitation

Our model has achieved good results in super-resolving images with both synthetic degradation and real-world. However, since our training data only covers blurring and noise, without considering more severe and complicated degradation, our model’s performance is not satisfactory when facing images with wild degradation. Meanwhile, due to the dependence on predicting specific kernel parameters, the accuracy of kernel estimation still has a moderate impact on the reconstructed image. We also conducted a comparison of running time and mode size with state-of-the-arts methods, and the results are shown in Table 6. Due to the global information modeling performed by the GLA¹⁶ module, the computational cost is increased. And channel split strategy increases memory access cost, which is a significant factor affecting inference speed.

Conclusion

In this work, we propose a blind SR network that is capable of combining kernel estimation with structural prior knowledge. Our method consists of two steps: degradation representation embedding and texture details recovery. A triple path attention block was first proposed to extract local spatial and channel features to compensate for the loss of high-frequency components caused by the first steps.

Subsequently, the global texture fusion block was used to fuse local and global textures, thus providing complementary information for the recovery of HR images. A serious of experiments on benchmarks with different degradation settings demonstrates that our method achieves outstanding performance in blind SR. In future work, we primarily have three main tasks: First, we will utilize contrastive learning to predict the degradation representation of images to disguise different types and levels of degradation, rather than specific parameters of kernel. Second, we will attempt more practical degradation methods to further generalize the model to real-world images.

Table 6 The comparison of complexity of different models. The inference latency is tested on RTX3090 GPU.

Full size table

Data availability

The test datasets analyzed during the current study on DIV2KRK and Gaussian8.

Code availability

The relevant code is made available on this link as open source.

References

Gu, J., Lu, H., Zuo, W. & Dong, C. Blind super-resolution with iterative kernel correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1604–1613 (2019).
Luo, Z., Huang, Y., Li, S., Wang, L. & Tan, T. End-to-end alternating optimization for blind super resolution. arXiv preprint arXiv:2105.06878 (2021).
Luo, Z. et al. Deep constrained least squares for blind image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17642–17652 (2022).
Zhang, K., Zuo, W. & Zhang, L. Learning a single convolutional super-resolution network for multiple degradations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3262–3271 (2018).
Huang, Y. et al. Unfolding the alternating optimization for blind super resolution. Adv. Neural Inf. Process. Syst. 33, 5632–5643 (2020).
Google Scholar
Dong, C., Loy, C. C., He, K. & Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision. 184–199 (Springer, 2014).
Lim, B., Son, S., Kim, H., Nah, S. & Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 136–144 (2017).
Zhang, Y. et al. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision. 286–301 (2018).
Bell-Kligler, S., Shocher, A. & Irani, M. Blind super-resolution kernel estimation using an internal-GAN. Adv. Neural Inf. Process. Syst. 32, 284–293 (2019).
Google Scholar
Ji, X. et al. Real-world super-resolution via kernel estimation and noise injection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 466–467 (2020).
Zhang, K., Liang, J., Van Gool, L. & Timofte, R. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4791–4800 (2021).
Wang, X., Xie, L., Dong, C. & Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1905–1914 (2021).
Shocher, A., Cohen, N. & Irani, M. “zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3118–3126 (2018).
Jo, Y., Oh, S. W., Vajda, P. & Kim, S. J. Tackling the ill-posedness of super-resolution through adaptive target generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16236–16245 (2021).
Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7794–7803 (2018).
Su, J.-N., Gan, M., Chen, G.-Y., Yin, J.-L. & Chen, C. P. Global learnable attention for single image super-resolution. In IEEE Transactions on Pattern Analysis and Machine Intelligence. 1–12 (2022).
Tong, T., Li, G., Liu, X. & Gao, Q. Image super-resolution using dense skip connections. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4799–4807 (2017).
Johnson, J., Alahi, A. & Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision. 694–711 (Springer, 2016).
Yuan, Y. et al. Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 701–710 (2018).
Ledig, C. et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4681–4690 (2017).
Barron, J. T. A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4331–4339 (2019).
Fritsche, M., Gu, S. & Timofte, R. Frequency separation for real-world super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 3599–3608 (IEEE, 2019).
Zhou, Y., Deng, W., Tong, T. & Gao, Q. Guided frequency separation network for real-world super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 428–429 (2020).
Luo, Z., Huang, Y., Li, S., Wang, L. & Tan, T. Learning the degradation distribution for blind image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6063–6072 (2022).
Zhang, K., Zuo, W. & Zhang, L. FFDNet: Toward a fast and flexible solution for CNN-based image denoising. IEEE Trans. Image Process. 27, 4608–4622 (2018).
Article MathSciNet ADS Google Scholar
Zhou, R. & Susstrunk, S. Kernel modeling super-resolution on real low-resolution images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2433–2443 (2019).
Xu, Y.-S., Tseng, S.-Y. R., Tseng, Y., Kuo, H.-K. & Tsai, Y.-M. Unified dynamic convolutional network for super-resolution with variational degradations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12496–12505 (2020).
Son, S., Kim, J., Lai, W.-S., Yang, M.-H. & Lee, K. M. Toward real-world super-resolution via adaptive downsampling models. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8657–8670 (2021).
Article Google Scholar
Zhang, K., Gool, L. V. & Timofte, R. Deep unfolding network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3217–3226 (2020).
Liang, J., Sun, G., Zhang, K., Van Gool, L. & Timofte, R. Mutual affine network for spatially variant kernel estimation in blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4096–4105 (2021).
Wang, L. et al. Unsupervised degradation representation learning for blind super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10581–10590 (2021).
Kim, S. Y., Sim, H. & Kim, M. Koalanet: Blind super-resolution using kernel-oriented adaptive local adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10611–10620 (2021).
Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning. 448–456 (PMLR, 2015).
Xie, L., Wang, X., Dong, C., Qi, Z. & Shan, Y. Finding discriminative filters for specific degradations in blind super-resolution. Adv. Neural Inf. Process. Syst. 34, 51–61 (2021).
Google Scholar
Yoo, J. et al. Rich CNN-transformer feature aggregation networks for super-resolution. arXiv preprint arXiv:2203.07682 (2022).
Chen, X., Wang, X., Zhou, J. & Dong, C. Activating more pixels in image super-resolution transformer. arXiv preprint arXiv:2205.04437 (2022).
Huang, C.-K. & Nien, H.-H. Multi chaotic systems based pixel shuffle for image encryption. Opt. Commun. 282, 2123–2127 (2009).
Article CAS ADS Google Scholar
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7132–7141 (2018).
Niu, B. et al. Single image super-resolution via a holistic attention network. In Proceedings of the European Conference on Computer Vision. 191–207 (Springer, 2020).
Liu, J., Zhang, W., Tang, Y., Tang, J. & Wu, G. Residual feature aggregation network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2359–2368 (2020).
Kong, F. et al. Residual local feature network for efficient super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 766–776 (2022).
Mei, Y. et al. Pyramid attention networks for image restoration. arXiv preprint arXiv:2004.13824 (2020).
Bevilacqua, M., Roumy, A., Guillemot, C. & Alberi-Morel, M. L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In British Machine Vision Conference. 135.1–135.10 (BMVA Press, 2012).
Zeyde, R., Elad, M. & Protter, M. On single image scale-up using sparse-representations. In Curves and Surfaces. 711–730 (Springer, 2012).
Martin, D., Fowlkes, C., Tal, D. & Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE/CVF International Conference on Computer Vision. Vol. 2. 416–423 (IEEE, 2001).
Huang, J.-B., Singh, A. & Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5197–5206 (2015).
Matsui, Y. et al. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 76, 21811–21838 (2017).
Article Google Scholar
Ahn, N., Kang, B. & Sohn, K.-A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European Conference on Computer Vision. 252–268 (2018).
Pan, J., Sun, D., Pfister, H. & Yang, M.-H. Deblurring images via dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2315–2328 (2017).
Article PubMed Google Scholar
Agustsson, E. & Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 126–135 (2017).
Timofte, R., Agustsson, E., Van Gool, L., Yang, M.-H. & Zhang, L. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 114–125 (2017).
Wang, Z., Bovik, A. C., Sheikh, H. R. & Simoncelli, E. P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004).
Article PubMed ADS Google Scholar
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Haris, M., Shakhnarovich, G. & Ukita, N. Deep back-projection networks for super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1664–1673 (2018).
Hussein, S. A., Tirer, T. & Giryes, R. Correction filter for single image super-resolution: Robustifying off-the-shelf deep super-resolvers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1428–1437 (2020).
Mou, C. et al. Metric learning based interactive modulation for real-world super-resolution. In Proceedings of the European Conference on Computer Vision. 723–740 (Springer, 2022).

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China under Grant 62171133, in part by the Artificial Intelligence and Economy Integration Platform of Fujian Province, and the Fujian Health Commission under Grant 2022ZD01003.

Author information

These authors contributed equally: Jiajun Zhang and Yuanbo Zhou.

Authors and Affiliations

The College of Physics and Information Engineering, Fuzhou University, Fuzhou, 350108, China
Jiajun Zhang, Yuanbo Zhou, Tong Tong & Qinquan Gao
The Beijing Radio and TV Station, Beijing, 100022, China
Jiang Bi, Wenlin He, Tao Zhao & Kai Sun
University of Edinburgh, Edinburgh, UK
Yuyang Xue
The Imperial Vision Technology, Fuzhou, 350000, China
Wei Deng
The College of Computer Engineering, Jimei University, Xiamen, 361021, China
Qing Zhang

Authors

Jiajun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuanbo Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Bi
View author publications
You can also search for this author in PubMed Google Scholar
Yuyang Xue
View author publications
You can also search for this author in PubMed Google Scholar
Wei Deng
View author publications
You can also search for this author in PubMed Google Scholar
Wenlin He
View author publications
You can also search for this author in PubMed Google Scholar
Tao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Kai Sun
View author publications
You can also search for this author in PubMed Google Scholar
Tong Tong
View author publications
You can also search for this author in PubMed Google Scholar
Qinquan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Qing Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.Z. analyzed the results and wrote the manuscript,Y.Z designed the research framework. J.B., Q.Z. and Y.X. revised the manuscript, W.d. , W.H. ,T.Z. and K.S. provided support for the research. T.T. and Q.G. review and supervision. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Qinquan Gao or Qing Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, J., Zhou, Y., Bi, J. et al. A blind image super-resolution network guided by kernel estimation and structural prior knowledge. Sci Rep 14, 9525 (2024). https://doi.org/10.1038/s41598-024-60157-9

Download citation

Received: 30 August 2023
Accepted: 19 April 2024
Published: 25 April 2024
DOI: https://doi.org/10.1038/s41598-024-60157-9

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation

A visual-language foundation model for computational pathology

Neural étendue expander for ultra-wide-angle high-fidelity holographic display

Introduction

Related work

SR of bicubic and multiple degradation

SR of unknown kernel

Method

Architecture

Degradation representation embedding

Texture details recovery

Triple path attention block

Global texture fusion block

Loss function

Experiments

Datasets and implementation details

Datasets and metrics

Isotropic Gaussian kernels

Anisotropic Gaussian kernels

Implementation details

Comparison with state-of-the arts

Evaluation with isotropic Gaussian kernels

Ablation study and discussion

Performance on real degradation

Discussion

Limitation

Conclusion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links