Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
OnlineRetraining		OnlineRetraining
coco		coco
data		data
docs		docs
img		img
tool		tool
voc12		voc12
.gitignore		.gitignore
AAF.py		AAF.py
LICENSE		LICENSE
README.md		README.md
datasets.py		datasets.py
engine.py		engine.py
evaluation.py		evaluation.py
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt
utils.py		utils.py
vision_transformer.py		vision_transformer.py

Repository files navigation

WeakTr

Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

Lianghui Zhu¹ *, Yingyue Li¹ *, Jiemin Fang¹, Xinggang Wang^{1 📧}, Yan Liu², Hao Xin², Wenyu Liu¹

¹ School of EIC, Huazhong University of Science & Technology, ² Ant Group

(*) equal contribution, (^📧) corresponding author.

ArXiv Preprint (arXiv 2304.01184)

Highlight

The proposed WeakTr fully explores the potential of plain ViT in the WSSS domain. State-of-the-art results are achieved on both challenging WSSS benchmarks, with 74.0% mIoU on PASCAL VOC 2012 and 46.9% on COCO 2014 validation sets respectively, significantly surpassing previous methods.
The proposed WeakTr based on the improved ViT pretrained on ImageNet-21k and fine-tuned on ImageNet-1k performs better with 78.4% mIoU on PASCAL VOC 2012 and 50.3% on COCO 2014 validation sets respectively.

Introduction

This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic Segmentation (WSSS). The class activation map (CAM) is of critical importance for understanding a classification network and launching WSSS. We observe that different attention heads of ViT focus on different image areas. Thus a novel weight-based method is proposed to end-to-end estimate the importance of attention heads, while the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects.

Step1: End-to-End CAM Generation

Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results to complete the WSSS task. We name this plain Transformer-based Weakly-supervised learning framework WeakTr. It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of COCO 2014.

Step2: Online Retraining with Gradient Clipping Decoder

Getting Started

Main results

Step1: End-to-End CAM Generation

Dataset	Method	Checkpoint	CAM_Label	Train mIoU
Pascal VOC 2012	WeakTr	Google Drive	Google Drive	69.3%
COCO 2014	WeakTr	Google Drive	Google Drive	41.9%

Step2: Online Retraining with Gradient Clipping Decoder

Dataset	Method	Checkpoint	Val mIoU	Pseudo-mask	Train mIoU
Pascal VOC 2012	WeakTr	Google Drive	74.0%	Google Drive	76.5%
Pascal VOC 2012	WeakTr	Google Drive	78.4%	Google Drive	80.3%
COCO 2014	WeakTr	Google Drive	46.9%	Google Drive	48.9%
COCO 2014	WeakTr	Google Drive	50.3%	Google Drive	51.3%

Citation

If you find this repository/work helpful in your research, welcome to cite the paper and give a ⭐.

@article{zhu2023weaktr,
      title={WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation}, 
      author={Lianghui Zhu and Yingyue Li and Jiemin Fang and Yan Liu and Hao Xin and Wenyu Liu and Xinggang Wang},
      year={2023},
      journal={arxiv:2304.01184},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeakTr

Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

Highlight

Introduction

Getting Started

Main results

Step1: End-to-End CAM Generation

Step2: Online Retraining with Gradient Clipping Decoder

Citation

About

Releases

Packages

Contributors 3

Languages

License

hustvl/WeakTr

Folders and files

Latest commit

History

Repository files navigation

WeakTr

Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

Highlight

Introduction

Getting Started

Main results

Step1: End-to-End CAM Generation

Step2: Online Retraining with Gradient Clipping Decoder

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages