Introduction

The task of image super-resolution (SR) is to reconstruct clear high-resolution images from low-resolution images. Image degradation is often considered as the inverse problem of SR, as it involves mathematically modeling the processes that deteriorate the quality of image. According to previous works1,2,3,4,5, the pipeline of degradation is typically modeled as Eq. (1).

$$\begin{aligned} y = (x *k_{h})_{\downarrow _{s}}+n, \end{aligned}$$
(1)

where x represents the high resolution (HR) image, while y corresponds to the low resolution (LR) image. The operator \(*\) denotes the two-dimensional convolution operation and \(k_{h}\) is the Gaussian kernel, \(\downarrow _{s}\) means downsampling operation with a scale factor of s, n refers to additive Gaussian white noise (AGWN). The classical SR methods6,7,8 assumes that the degradation pipeline is a single bicubic downsampling. However, if the predefined degradation does not exactly match the practical situation, the reconstructed HR image may exhibit unpleasant artifacts1. Therefore, recovering shape edges and rich details in the case of LR images with unknown degradation1,2,5,9,10,11,12, is an extremely meaningful and challenging task.

The most common blind SR schemes are typically divided into two stages: the first stage is to model the kernel explicitly or implicitly through optimizing a deep neural network from the degraded image1,2,3,4,5,9, and the second stage inputs the LR image combined with additional degradation prior through the SR network to obtain reconstructed HR image. In first stage, the mismatch between estimated blur kernel and the actual one can lead to over-smoothed or over-sharpened results1,2,3. An available solution is to perform accurate estimation of the kernel1,9 and robust integration with the SR backbone2,3,5.

Recent research1,2,3,4,5,9 has mainly concentrated on the first stage of kernel modeling. DCLS3 proposes a robust dynamic kernel estimation network and introduces a module to achieve degradation representation embedding. However, its SR network has limited ability to represent spatial features, making it difficult to recover structural information well. Fig. 1 shows the reconstruction results of state-of-the-art methods and our method for structural textures. It can be observed obviously that current methods lacks the combination of structural prior knowledge, making the ambiguous details and edges in the recovered SR image.

Figure 1
figure 1

Blind super-resolution of Img100 from DIV2KRK9, for scale factor 4. Based on the fusion of local and global features, our method is effective in restoring sharp and clean edges, and outperforms previous state-of-the-art approaches such as ZSSR13, IKC1, AdaTarget14, DANv22, and DCLS3.

It is broadly recognized that non-local operations15,16, which introduce self-similarity priors, are significant for recovering recurring textures within the same image. Moreover , the spatial attention and channel attention mechanisms can effectively capture local features. Motivated by these observations, we propose a network combined kernel estimation and structural prior knowledge that can leverage both local spatial and global features to boost reconstruction performance for images with high self-similarity. To be specific, we employ the deep constrained least squares3 (DCLS) block as the module to deblur the original feature \(f_o\), in order to obtain a clean feature \(f_c\). Next, we divide the original feature \(f_o\) into two vectors along the channel dimension: \(\widehat{f_o}\), and \(\overline{f_o}\). These three vectors \(f_c\), \(\overline{f_o}\), and \(\widehat{f_o}\), are together fed into a series of triple path attention blocks (TPAB) to perform deep feature extraction and utilize local spatial information to compensate for the gap caused by kernel estimation. Furthermore, the global texture fusion block (GTFB) adaptively adjusts the self-similarity scores of non-local features to achieve the embedding of global structural prior. We have performed several standard experiments on benchmarks with various degradation settings to evaluate our proposed method. The quantitative and qualitative results demonstrate that our network has excellent performance in all datasets, particularly for images with rich structural information. The main contributions of this paper are summarized as follows:

  • We propose a blind SR network, capable of combining kernel estimation with structural prior knowledge to reconstruct the textures with high self-similarity.

  • We employ a channel split strategy to take advantage of the original local spatial and channel features in order to compensate for artifacts generated by the kernel estimation and the deblurring operation.

  • We design a global texture fusion block that aggregates local spatial features with non-local operations to enhance recovery performance in images with high self-similarity.

  • Extensive experiments with various degradation settings demonstrate that our method achieves outstanding performance in the task of blind SR.

Related work

SR of bicubic and multiple degradation

The pioneering work of SRCNN6 has successfully motivated interest among researchers in the field of SR. Inspired by hierarchical architecture7,8,17 and robust loss function11,12,18,19,20,21, CNN-based methods have achieved outstanding performance on predefined bicubic downsampling in the SR task, while the degradation process in the real-world are generally unknown and complicated11,12. In practical applications, if the bicubic kernel assumed by classical methods does not match the actual degradation kernel, it will lead to unpleasant artifacts in the reconstructed SR image, severely affecting the visual perception quality. This discrepancy between the assumed kernel and the actual kernel give rise to domain gap22,23,24, which is a challenge in practical applications of SR.

Another approach to non-blind SR method4,25,26,27,28 is designed to super-resolve multiple types of degraded images with corresponding kernels. These methods make classical SR networks more robust and applicable to a wider range of real-world scenarios. FFDNet25 utilizes a noise level map as additional input, allowing it to handle various noisy images affected by different types of degradation. Similarly, SRMD4 proposes a kernel stretching strategy that incorporates the two degradation parameters, the blur kernel k and the noise level n, together with the LR as input to SR network. Zhang et al.29 combines learning-based methods with model-based methods to design an end-to end unfolding networks that can handle various types of degraded images with different scales. UDVD27 introduces dynamic convolution in the kernel estimation network, where the parameters of the filters can be dynamically adjusted based on the adaptivity of the input degraded image. KMSR26 utilizes generative adversarial networks to learn the distribution of kernels in real degraded images. Inspired by KMSR26, Son et al.28 propose an adaptive downsampling model that employs an unsupervised approach to simulate the actual degradation process of real-world images. They then synthesize paired data and develop an SR network capable of handling various types of degradation.

SR of unknown kernel

The most common approach for the blind SR task is based on kernel estimation methods1,2,3,4,5,9,30. KernelGAN9 utilizes cross-scale image similarity to accomplish kernel estimation on specific images and combined it with a classical method13 to achieve blind reconstruction. MANet30 further investigates spatially variant blur kernels in order to super-resolve objection motion and out-of-focus in real world scenarios. Gu et al.1 use an iterative correction method to alleviate the effects caused by the mismatch between estimated result and practical kernel. Luo et al.2,5 adopt an end-to-end network to alternately optimize estimator and restorer. These two methods1,2 are effective but time-consuming owing to the elaborate optimization steps. DCLS3 reformulates a practical degradation model and proposes a deep constrained least squares module to operate deconvolution in order to achieve robust degradation awareness. In the aforementioned methods1,2,3,5,9,22,23, the solution is concentrated on modeling degradation either implicitly22,23,31 or explicitly1,2,3,4,5,9,10,32 without delving into the function of structural textures as prior knowledge. This may be a potential factor leading to the upper bound of blind SR performance.

Figure 2
figure 2

The overall architecture of our network and the structure of related blocks. Given an LR image, we first estimate the kernel k, and feed into DCLS module to achieve degradation presentation embedding. The triple path attention groups utilize the clean feature \(f_c\) and the chunked original feature \(\overline{f_o}\) and \(\widehat{f_o}\) as input to restore the clean SR image.

Method

Architecture

In this subsection, we will introduce the overall architecture of our model. As shown in Fig. 2, our method mainly contains two stages: degradation representation embedding, and texture details recovery. The first stage includes the dynamic kernel estimation and deblurring operation based on the DCLS3 module. The estimator \(N_e\) accomplishes robust kernel estimation from degraded LR image. Next, the LR image and the estimated blur kernel k are jointly input into the DCLS module for deblurring. Lastly, the clean and original shallow features are fed into the triple path attention network to achieve local and global features fusion, which consists of triple path attention blocks (TPAB) and global texture fusion blocks (GTFB). Details on the pipeline of our method and the relevant blocks will be described in the following subsections.

Degradation representation embedding

Figure 3
figure 3

The overall architecture of dynamic kernel estimation. Given an LR image input, it first generate four specific filters. Then, these filters convolved sequentially with an identity kernel \(I_k\) to produce a single kernel k with a larger receptive field corresponding kernel size.

Inspired by the work of3, our method employs the dynamic kernel estimation, as shown in Fig. 3. Given an LR image with unknown degradation as input, three residual blocks are applied to extract deep features \(f_s\), followed by global average pooling to obtain the flattened features \(\overline{f_s}\). The fully connected layer maps the specific degradation information to the four various filters, \(\widehat{h_0}\) , \(\widehat{h_1}\) , \(\widehat{h_2}\) , and \(\widehat{h_3}\) , with kernel sizes set to \(11\times 11\), \(7\times 7\), \(5\times 5\) and \(1\times 1\), respectively, to adjust the receptive filed consistency with the kernel sizes of predicted kernel k. The process of dynamic estimation is shown in Eq. (2).

$$\begin{aligned} k = I_k*\widehat{h_0}*\widehat{h_1}*\widehat{h_2}*\widehat{h_3}, \end{aligned}$$
(2)

where \(I_k\) is the identity kernel, and \(\widehat{h_0}\) , \(\widehat{h_1}\) , \(\widehat{h_2}\) , and \(\widehat{h_3}\) are specific filters mapped from degradation information, k is the estimated kernel through Estimator \(N_e\). The \(I_k\) is sequentially convolved with these filters, enabling the parameters in network \(N_e\) to vary with different degraded inputs. Meanwhile, the DCLS3 module utilizes deconvolutional operations to obtain clean feature as Eq. (3).

$$\begin{aligned} f_c = DCLS_{deconvolve}(f_{o},k), \end{aligned}$$
(3)

where \(f_{o}\) represents the blurry original features extracted by a \(3\times 3\) convolution layer and three residual blocks from the LR image, k is the kernel predicted by the network \(N_e\), \(f_{c}\) represents the deblurred clean features through the deconvolutional operation via the DCLS3 module.

Texture details recovery

Even with introducing deconvolutional operation through the DCLS3 module, the damaged high-frequency information cannot be fully restored. Therefore, we propose a novel network that not only strongly extracts local features to compensate for the decline of high-frequency components but also incorporates non-local15,16 operation to fuse the local and global features.

Figure 2 illustrates the proposed SR network, mainly consists of the extraction process of original features and the fusion process of local features with global features. A \(3\times 3\) convolutional kernel and three residual blocks without batch normalization33 is used to extract original features \(f_o\) as Eq. (4).

$$\begin{aligned} f_o = h_{Reslobck}(h_{conv}(I_{LR})), \end{aligned}$$
(4)

where \(I_{LR}\in {R^{H\times W\times C}}\) is an LR image as input, H and W represent the height and width of the patch that is cropped from a sub-image, and C is the RGB channels in the image.

In previous stages we have obtained clean features \(f_c\). FAIG34 demonstrates that one branch network without degradation prior can achieve comparable performance to the two-branch method with degradation information. Although it may be reasonable to directly use the clean feature \(f_c\) as input to the SR network for recovery, the offset of kernel estimation9,30 and insufficiency of deblurring function in the DCLS3 module would prevent the SR network from effectively restoring highly structured textures in the SR backbone. Therefore, we propose a Triple Path Attention Group (TPAG) to extract deep feature f as Eq. (6).

$$\begin{aligned} \psi (fc,\overline{f_o},\widehat{f_o})&= h_{GTFB}(h^n_{TPAB}(fc,\overline{f_o},\widehat{f_o})), \end{aligned}$$
(5)
$$\begin{aligned} f&=\psi _N(\psi _{N-1}(\psi _2(\cdots \psi _1((fc,\overline{f_o},\widehat{f_o})))), \end{aligned}$$
(6)

where the \(\psi (fc,\overline{f_o},\widehat{f_o})\) represents TPAG that adopts the clean feature \(f_c\), chunked original feature \(\overline{f_o}\) and \(\widehat{f_o}\) as additional inputs, \(h_{GTFB}(h^n_{TPAB})\) means that the group is composed of n Triple Path Attention Blocks (TPAB) and one Global Texture Fusion Block (GTFB). f is the deep clean feature, N is the number of TPAG in our SR network.

In addition, we further refine the deep feature f through a \(3\times 3\) convolutional layer with the original low-frequency feature \(f_{o}\) connected through long skip connections7,8,35,36, as Eq. (7).

$$\begin{aligned} I_{SR} = h_{upsample}(h_{conv}(f)+f_{o}). \end{aligned}$$
(7)

Finally, pixel shuffle37 serves as the upsampling module and completes the mapping from feature maps to HR image \(I_{SR}\).

Triple path attention block

Deep SR networks contain specific filters that can handle various types and levels of degraded images34. These specific filters, which can be used to address corresponding degradation such as noise and blur, are located at different positions and branches within a single SR network. Channel attention8,36,38,39 and spatial attention40,41 mechanisms can enhance the local modeling ability. Therefore, we introduce these mechanisms as two branches in TPAB, allowing the network to strengthen its generalization and better handle different types of degradation.

The triple path attention blocks, consisting of residual channel attention and residual local spatial blocks, is shown in Fig. 2. The original shallow features \(f_{o}\) are split into two feature maps \(\overline{f_o}\) and \(\widehat{f_o}\) along the channel dimension. They are combined with the deblurred clean features \({f_c}\) and passed through TPABs to refine local texture features and compensate for the loss of high-frequency texture details. Specifically, \(\overline{f_o}\) and \(\widehat{f_o}\) are processed respectively by residual channel attention branches8 and residual local spatial branches41 to extract deep local features. Meanwhile, \(\overline{f_o}\) and \(\widehat{f_o}\) are concatenated with \(f_o\) and fused by a convolutional layer. Lastly, the aggregated local features pass through a GTFB to establish connections between local and non-local features.

Global texture fusion block

Non-local15,16,42 operations are capable of capturing long-range dependencies between different parts of an image, addressing the limitation of receptive filed by introducing self-attention mechanisms that enable each position to attend to all other positions in the input data. This operation is particularly instrumental in restoring structural textures that exhibit strong self-similarity. Previous researchers15,42 hypothesized that non-local textures with higher similarity scores would be more advantageous for restoring edge information. However, they overlooked an objective fact that when an image suffers from severe degradation, non-local textures with low similarity scores may actually be more useful for restoring edges16.

Fusing the local spatial texture features without careful consideration does not significantly improve the network’s ability to restore textures. Therefore, we cascade a global texture feature fusion block (GTFB) at the end of each TPAG. In the module, we adopt the global learnable attention block16 after the local feature fusion. The global learnable attention block adaptively adjusts the similarity scores of non-local textures, allowing the network to effectively utilize non-local textures that previously had low similarity scores but can provide rich details.

As shown in Fig. 4, we input the feature map \(X\in R^{H\times W \times C}\) as the input and convert X into three 1D vectors Q, L and \(V\in R^{C\times HW }\) to achieve global attention mechanism. Super-Bit Locality-Sensitive Hashing (SB-LSH) divides the feature map into buckets to reduce computation costs, as shown in the Eq. (8).

$$\begin{aligned} \lambda _i = \left\{ x_j|argmax(MX_i)=argmax(MX_j) \right\} , \end{aligned}$$
(8)

where \(M \in R^{b\times c}\) is a randomly initialized orthogonal matrix and b is the number of hash buckets, \(X_i\in R^C\) is the \(i-th\) component of \(Q_i\), \(\lambda _i\) is the index set corresponding to \(Q_i\). Next, we use learnable similarity score \(X_l\) (LSS) and fixed dot product similarity score \(X_f\) (DPSS) to measure self-similarity as Eq. (9).

$$\begin{aligned} S(X_i) = S_f(X_i)+S_l(X_i), \end{aligned}$$
(9)

where \(S_f(X_i)=X^T_i X_i\), \(S_l(X_i)\) is defined as Eq. (10).

$$\begin{aligned} S_l(X_i) = (W_2\sigma (W_1L[\lambda _i]+b_1)+b_2) , \end{aligned}$$
(10)

where \(\sigma\) is the ReLU activation and \(W_1, W_2, b_1, b_2\) are learnable parameters.

Figure 4
figure 4

The details about global learnable attention16 block.

Table 1 The quantitative results on benchmarks with Gaussian8 kernels.
Table 2 The quantitative comparison on benchmarks with Gaussian8 kernels and various noise levels.

Loss function

Our model includes the kernel estimation task and the reconstruction task. We jointly optimize our model using \(L_1\) Loss \(L_{kernel}\) and Charbonnier Loss \(L_{pixel}\), as shown in the Eq. (11).

$$\begin{aligned} L_{total} = L_{kernel}+L_{pixel}, \end{aligned}$$
(11)

where the \(L_{kernel}=||k-k_l||\) is the \(L_1\) loss between estimated kernel k and the ground truth blur kernel \(k_l\). The pixel loss is defined as \(L_{pixel}=\sqrt{(I_{SR}-I_{HR})^2+\epsilon }\), where \(I_{SR}\) and \(I_{HR}\) denote the super-resolved image and the ground-truth HR image, \(\epsilon\) is a constant and usually \(1\times 10^{-6}\).

Experiments

Datasets and implementation details

Datasets and metrics

Following previous work1,2,5, we used the DIV2K50 (800) and the Flickr2K51 (2650) as the training data, which together contain 3450 2K HR images. We adopt both isotropic and anisotropic Gaussian kernels as assumed degradation to synthesize corresponding LR images according to Eq. (1). The experimental results are evaluated using the PSNR and SSIM52 metrics for fidelity, which are only calculated on the Y channel of the YCbCr color space.

Figure 5
figure 5

The visual results of sig1.8_img093,sig2.4_img024,sig3.0_img073 in Urban10046 and sig3.2_YouchienBoueigumi in Manga10947.

Isotropic Gaussian kernels

In the setting 1, isotropic Gaussian kernels are first applied in our study as the same in1,2,3,5. The kernel size is fixed to \(21\times 21\) during both the training and testing phases. During the training process, we randomly sampled the kernel width from the ranges of [0.2, 2.0] , [0.2, 3.0] , and [0.2, 4.0] uniformly for scale factors of 2, 3, and 4, respectively. During the testing phase, we used Gaussian8 kernels to degrade five benchmarks, including Set543, Set1444, B10045, Urban10046, and Manga10947. Gaussian8 uniformly selects 8 kernels from the ranges [0.80, 1.60], [1.35, 2.40], and [1.80, 3.20] for scale factors 2, 3, and 4, respectively. Subsequently, the HR images are convolved with 8 various blur kernels and downsampled to obtain corresponding LR images.

Anisotropic Gaussian kernels

In the setting 2, anisotropic Gaussian kernels were employed in our study follwing the work in1,2,3,5,9. The kernel size is \(11\times 11\) and \(31\times 31\) for scale factors 2 and 4 respectively in the training stages. During the training process, we randomly sampled the kernel width from the ranges of [0.6, 5] and rotated it from the range \([\) \(-\pi\) \(,\) \(\pi\) \(]\). During the testing process, blind SR benchmark DIV2KRK9 were used for evaluation.

Implementation details

We cropped the training data into sub-images of size \(480\times 480\), and utilized LR patches of size \(64\times 64\) to feed into our model. Our SR network consists of 6 groups of TPAG, each consisting of 11 TPABs and 1 GTFB. We trained the model using 8 RTX2070 GPUs, with a batch size of 4 for each GPU. The initial learning rate was \(1\times 10^{-4}\) and decayed by half at every \(2\times 10^{5}\) iterations, the total number of iterations was \(1\times 10^{6}\). We used the Charbonnier loss21 as loss function and Adam53 optimizer with \(\beta _1\) 0.9 and \(\beta _2\) 0.99 for optimization. We also adopt horizontal flipping and \(90^{\circ }\) rotation as data augmentation strategies during the training phase.

Comparison with state-of-the arts

Evaluation with isotropic Gaussian kernels

We have evaluated our method on benchmarks synthesized by Gaussian8 kernels and compared its performance with those using state-of-the-art blind SR methods, including ZSSR13, IKC1, DANv15, DANv22, AdaTarget14, KOALAnet32, and DCLS3. Additionally, CARN48 as a lightweight non-blind SR model that combined with blind deblurring49 method was also implemented for comparison.

The quantitative comparisons on benchmarks with Gaussian8 kernels are shown in Table 1. Our method achieves remarkable results on various benchmarks, particularly exhibiting noticeable performance on datasets with strong self-similarity, such as Urban10046 and Manga10947, nearly + 0.16dB and + 0.15dB than DCLS3 on \(\times\)4 factor. Bicubic interpolation and CARN48 are non-blind SR methods that assume a known bicubic degradation, which deviates from the actual situation, resulting in a severe drop in performance. ZSSR13 utilizes the internal statistics of patch recurrence to build an image-specific super-resolution method that does not require external datasets. This approach slightly improves performance due to the lack of abundant training data and powerful fitting ability. Performing the blind deblurring49 operation on the reconstructed image can moderately improve performance by reducing artifacts caused by domain gap. Conversely, applying the inverse operation may further damage details in the LR image, leading to unsatisfactory SR results. The IKC1 and DAN5 compensate for the offset caused by kernel estimation through iterative correction and end-to-end alternate optimization, respectively, significantly improving the performance. DCLS3 can retain the spatial information of the blur kernel while introducing dynamic convolution to boost the robustness of estimation, thus achieving superior performance.

Our proposed TPAB compensates for the attenuation of high-frequency components caused by the DCLS3 deconvolution module and the GTFB integrates non-local features with low similarity scores to assist in the fusion of local and global features. The qualitative visual results in Fig. 5 also demonstrate that our method is capable of recovering sharp edges and rich details. Furthermore, considering the complexity of actual degradation, we conduct an extra experiment to handle images with Gaussian8 kernels and additional noise. The quantitative results, shown in Table 2, validate that our method also has a certain degree of robustness to additional noise.

Table 3 The quantitative results on DIV2KRK benchmark with isotropic Gaussian kernel.

Table 3 shows the quantitative results of these methods on the DIV2KRK9 dataset. The results indicates that ZSSR13 can serve as a method for improving bicubic interpolation performance. When combined with the kernel estimation by KernelGAN9 as a prior, the performance of ZSSR13 is further improved. SRMD4 shows the consistently with bicubic interpolation. Classical SR methods such as RCAN8, EDSR7, and DBPN54, which adopted paired training data degraded by bicubic downsampling, suffer an extreme decrease in performance due to domain gap. The correction filter55 modifies the blurry image to match bicubic kernel, significantly improving the performance of DPBN54 trained on bicubic kernel.

Among the remaining blind SR methods, which contain IKC1, DAN2,5, KOALAnet32, AdaTarget14,and DCLS3, our method performed slightly superior than the DCLS3. This circumstance is consistent with our hypothesis. Due to the wild degradation of the DIV2KRK9 dataset, the textures and edges are damaged severely. The compensation of TPAB module for high-frequency features is limited. GTFB cannot accurately adjust the similarity score of local textures, resulting in the reconstruction of high-frequency information that is not as good as isotropic Gaussian kernels with mild degradation.

Ablation study and discussion

Table 4 The details of ablation study.
Table 5 The ablation study on benchmarks with Gaussian8 kernels.

In this subsection, we performed a series of ablation experiments on the two crucial modules proposed by us, TPAB and GTFB, to quantitatively study their contributions to our method. The specific settings related to the ablation experiments are shown in the Table 4.

Firstly, the DCLS3 adopt clean feature \(f_c\) with original \(f_o\) as input to feed into Double Path Attention Groups (DPAG) to reconstruct HR images. The DCLS was used as baseline to explore the function of our proposed modules TPAB and GTFB.

Secondly, we placed DPAG with our proposed TPAG, where original feature \(f_o\) was split into \(\overline{f_o}\) and \(\widehat{f_o}\) to extract channel and spatial local feature to compensate for high-frequency decline. In this setting, without the function of global feature fusion, the single GTFB was placed by a TPAB. It can be observed from Table 5 that adding only the TPAB module resulted in a minimal improvement in performance(+ 0.02db in Set1444 and + 0.01dB in Manga10947). This may be because the depth of TPAG is already sufficient for extracting degradation feature, and using TPAB alone to capture local texture features has limited compensatory effects on high-frequency information.

Lastly, we utilized a variant network consisting of Double Path Attention blocks (DPAB) and Global texture fusion block to evaluate the contribution of GTFB, we appended a GTFB in each DPAG. The results shows a similar trend to the previous experiments, indicating GTFB could better utilize non-local textures to reconstruct high-frequency details. However, due to the lack of tiny compensation from the TPAB module, there is only a moderate performance improvement(about + 0.05dB in Urban10046), and the ability to reconstruct texture information was still insufficient.

Performance on real degradation

To further demonstrate the effectiveness of our method, we utilized the proposed model with isotropic Gaussian kernels and additional noise level 15 on real degradation images where the degradation is complicated and unknown. Our model was compared with classical real-world super resolution methods including RealSR10, BSRGAN11, Real-ESRGAN12, DASR31, and MM-RealSR56 on Real2011 dataset. An example of super-resolving chip image is shown in Fig. 6. Our method still produce rich details and sharp edges.

Discussion

The specific results of the ablation experiments are shown in Table 5. It is evident that adding either module alone only results in a marginal performance gain(approximately + 0.05dB in Set1444 and BSD10045). However, the flexible combination of two modules achieves astonishingly higher performance (+ 0.16dB and + 0.13dB in Urban10046 Manga10947 respectively than only one module). One possible reason is that even slight compensation of high-frequency information is crucial for the adaptive adjustment of similarity scores in global learnable attention16 block. With the aggregation of local features on both channel and spatial dimensions introduced by the TPAG module, the GTFB exhibits a stronger ability to fuse global information.

Limitation

Our model has achieved good results in super-resolving images with both synthetic degradation and real-world. However, since our training data only covers blurring and noise, without considering more severe and complicated degradation, our model’s performance is not satisfactory when facing images with wild degradation. Meanwhile, due to the dependence on predicting specific kernel parameters, the accuracy of kernel estimation still has a moderate impact on the reconstructed image. We also conducted a comparison of running time and mode size with state-of-the-arts methods, and the results are shown in Table 6. Due to the global information modeling performed by the GLA16 module, the computational cost is increased. And channel split strategy increases memory access cost, which is a significant factor affecting inference speed.

Figure 6
figure 6

Comparison of real-world image of chip in Real2011 dataset for x4 SR. The methods include RealSR10, BSRGAN11, Real-ESRGAN12, DASR31, MM-RealSR56.

Conclusion

In this work, we propose a blind SR network that is capable of combining kernel estimation with structural prior knowledge. Our method consists of two steps: degradation representation embedding and texture details recovery. A triple path attention block was first proposed to extract local spatial and channel features to compensate for the loss of high-frequency components caused by the first steps.

Subsequently, the global texture fusion block was used to fuse local and global textures, thus providing complementary information for the recovery of HR images. A serious of experiments on benchmarks with different degradation settings demonstrates that our method achieves outstanding performance in blind SR. In future work, we primarily have three main tasks: First, we will utilize contrastive learning to predict the degradation representation of images to disguise different types and levels of degradation, rather than specific parameters of kernel. Second, we will attempt more practical degradation methods to further generalize the model to real-world images.

Table 6 The comparison of complexity of different models. The inference latency is tested on RTX3090 GPU.