Yunpeng Qu1,2 | Kun Yuan2 | Kai Zhao2 | Qizhi Xie1,2 | Jinhua Hao2 | Ming Sun2 | Chao Zhou2
1Tsinghua University, 2Kuaishou Technology.
Diffusion-based methods, endowed with a formidable generative prior, have received increasing attention in Image Super-Resolution (ISR) recently. However, as low-resolution (LR) images often undergo severe degradation, it is challenging for ISR models to perceive the semantic and degradation information, resulting in restoration images with incorrect content or unrealistic artifacts. To address these issues, we propose a Cross-modal Priors for Super-Resolution (XPSR) framework. Within XPSR, to acquire precise and comprehensive semantic conditions for the diffusion model, cutting-edge Multimodal Large Language Models (MLLMs) are utilized. To facilitate better fusion of cross-modal priors, a Semantic-Fusion Attention is raised. To distill semantic-preserved information instead of undesired degradations, a Degradation-Free Constraint is attached between LR and its high-resolution (HR) counterpart. Quantitative and qualitative results show that XPSR is capable of generating high-fidelity and high-realism images across synthetic and real-world datasets.
## git clone this repository
git clone https://github.com/qyp2000/XPSR.git
cd XPSR
# create an environment with python >= 3.9
conda create -n xpsr python=3.9
conda activate xpsr
pip install -r requirements.txt
- Download SD-v1.5 models from huggingface and put them into
checkpoints/stable-diffusion-v1-5
. - Download pretrained XPSR model from GoogleDrive and put it into
runs/xpsr
. - Prepare testing images in the
testset
.
We use llava-v1.5-7b from huggingface to generate high-level prompts.
In addition, in order to improve the model's perception of the low-level factors, we use the MLLM finetuned through q-instruct to generate low-level prompts.
You can also download the two MLLMs in advance and place them into checkpoints/
.
./utils_data/highlevel_prompt_test.sh
./utils_data/lowlevel_prompt_test.sh
python test.py
You can modify the parameters in configs/xpsr_test.yaml
to adapt to your specific need, such as the guidance_scale
and the num_inference_steps
.
- Download SD-v1.5 models from huggingface and put them into
checkpoints/stable-diffusion-v1-5
. - Prepare training images in the
gt_path/
.
We generate training data based on the degradation pipelines of Real-ESRGAN.
The generated images will be saved in trainset/
.
./utils_data/make_train.sh
Based on the same approach, corresponding degraded images can be generated for the testset.
./utils_data/make_valid.sh
You can modify the args in make_valid.py
to adapt to your specific need.
Due to the significant overhead of MLLMs, we generate prompts in advance for training data so that they can be directly called during training.
./utils_data/highlevel_prompt_train.sh
./utils_data/lowlevel_prompt_train.sh
This process requires a lot of time, and we suggest that you use as many GPUs as possible to participate in the generation.
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7," accelerate launch train.py
You can modify the parameters in configs/xpsr.yaml
to adapt to your specific need, such as the train_batch_size
and the learning_rate
.
If our work is useful for your research, please consider citing:
@article{qu2024xpsr,
title={XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution},
author={Qu, Yunpeng and Yuan, Kun and Zhao, Kai and Xie, Qizhi and Hao, Jinhua and Sun, Ming and Zhou, Chao},
journal={arXiv preprint arXiv:2403.05049},
year={2024}
}
The reorganization of the codes are rushed, and there may be many problems.
Please feel free to contact: [email protected]
.
I am very pleased to communicate with you and will maintain this repository during my free time.
Some codes are brought from PASD and SeeSR. Thanks for their excellent works.