DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting


Yongming Rao*,1 Wenliang Zhao*,1 Guangyi Chen2,3 Yansong Tang4 Zheng Zhu1

Guan Huang5 Jie Zhou1 Jiwen Lu1

1Tsinghua University 2MBZUAI 3CMU 4University of Oxford 5PhiGent Robotics

[Paper (arXiv)] [Code (GitHub)]


Figure 1: Unlike the conventional ``pre-training + fine-tuning'' paradigm, our proposed DenseCLIP transfers the knowledge learned with image-text contrastive learning to dense prediction models via a new pixel-text matching task and further using the contextual information from images to prompt the pre-trained language model.

Figure 2: Results of different pre-training and fine-tuning strategies on the semantic segmentation task. We report the single-scale and multi-scale mIoU on ADE20K of different pre-trained ResNet-50 models, including supervised ImageNet1K (IN1K) and ImageNet21K (IN21K), self-supervised MoCoV2 and DenseCL, and vision-language model CLIP. Equipped with DenseCLIP, we show that large-scale vision-language pre-training can substantially improve the dense prediction performance (+4.9%/+4.1%) over the commonly used ImageNet pre-training.

Abstract

Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision. Benefiting from a broader source of supervision, this new paradigm exhibits impressive transferability to downstream classification tasks and datasets. However, the problem of transferring the knowledge learned from image-text pairs to more complex dense prediction tasks has barely been visited. In this work, we present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pre-trained knowledge. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones including both CLIP models and ImageNet pre-trained models. Extensive experiments demonstrate the superior performance of our methods on semantic segmentation, object detection, and instance segmentation tasks.

Approach

Figure 3: The overall framework of DenseCLIP. DenseCLIP first extracts the image embeddings and $K$-class text embeddings, and then calculates pixel-text score maps to convert the original image-text matching problem in CLIP to pixel-text matching for dense prediction. These score maps are fed into decoder and also supervised using the ground-truth labels. To better exploit the pre-trained knowledge, DenseCLIP uses the contextual information in images to prompt the language model with a Transformer module.


Results

  • Our method can be a plug-and-play module to improve the fine-tuning of CLIP pre-trained models on off-the-shelf dense prediction methods and tasks.

  • Our framework can also be applied to any backbone models by using the pre-trained language model to guide the training of dense prediction tasks.

Table 1: Semantic segmentation results on ADE20K. We compare the performance of DenseCLIP and existing methods when using the same backbone. We report the mIoU of both single-scale and multi-scale testing, the FLOPs and the number of parameters. The FLOPs are measured with 1024×1024 input using the fvcore library. The results show that our DenseCLIP outperforms other methods by large margins with much lower complexity. Our models and our baselines that are trained using identical settings are highlighted in gray.


Table 2: Ablation study. We demonstrate that performing post-model vision-to-language prompting can yield the better performance with fewer extra FLOPs and parameters.


Table 3: Object detection on COCO val2017 using RetinaNet. We compare our DenseCLIP framework to the vanilla fine-tuning of ImageNet/CLIP pre-trained models. We find DenseCLIP can better make use of the language priors to facilitate better training.

Table 4: Applying DenseCLIP to any backbone. Image backbones (such as ImageNet pre-trained ResNet~\cite{he2016deep} and Swin~\cite{swin}) equipped with our DenseCLIP benefit from the language priors and enjoy significant performance boost. We report mIoU on ADE20K dataset for both single-scale (SS) and multi-scale (MS) testing.


Figure 4: Qualitative results on ADE20K. We visualize the segmentation

results on ADE20K validation set of our DenseCLIP based on ResNet-101 and two baseline models.


BibTeX

@inproceedings{rao2021denseclip,

title={DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting},

author={Rao, Yongming and Zhao, Wenliang and Chen, Guangyi and Tang, Yansong and Zhu, Zheng and Huang, Guan and Zhou, Jie and Lu, Jiwen},

journal={arXiv preprint arXiv:2112.01518},

year={2021}

}