Skip to content

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation

License

Notifications You must be signed in to change notification settings

silencht/Depth-Anything

 
 

Repository files navigation

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Lihe Yang1 · Bingyi Kang2+ · Zilong Huang2 · Xiaogang Xu3,4 · Jiashi Feng2 · Hengshuang Zhao1+

1The University of Hong Kong · 2TikTok · 3Zhejiang Lab · 4Zhejiang University

+corresponding authors

Paper PDF Project Page

Depth Anything Pipeline

个人笔记

先使用带标签的数据集训练一个教师模型T,也就是实线部分

直接训练学生模型S(虚线部分)肯定超不过教师,所以学生模型训练时在无标签图像加入失真(颜色、空间、高斯模糊、裁剪、混合等)进行数据增强。

使用语义信息辅助,用了一堆分割网络给无标签图像打标签,但是提升有限,作者推断是离散空间丢失了太多语义信息。所以就用了DINOv2模型直接提取语义特征(而非简单的标签)。所以语义信息的特征就是连续且丰富的了。然后就最小化S提取的特征fi和DINOv2提取的fi'的余弦相似度Loss。

但是语义信息只能说明是一个物体,但不能说明物体每个部位深度都一样。所以就设置了一个阈值alpha,一旦余弦相似度超过阈值,就不考虑这项Loss了(所以就只考虑教师T的假标签监督)。

所以最终就三个Loss:一个教师模型的,一个学生模型(叫什么放射不变损失,其实就是把输入图像加入失真后,降低S和T预测的差异),还有一个语义辅助的。

实现细节

用DINOv2做的图像编码器,用DPT做的解码器做深度值回归。

第一阶段,训20轮教师模型T;

第二阶段,一次性扫描完所有无标签图像,用教师模型T(基于ViT-L编码器)打标签。

每批次带标签和不带标签的图像比例是1:2.

所有阶段中,预训练好的编码器学习率设置5e-6,解码器使用10倍的学习率。用AdamW优化器,使用线性调度器下降学习率。

带标签的仅仅使用水平图像翻转来增强数据。

阈值alpha设为0.15。

训练中把所有图片搞成518*518,推理的时候不搞。只确保是14倍数(DINOv2编码器的预定义patch size)

This work presents Depth Anything, a highly practical solution for robust monocular depth estimation by training on a combination of 1.5M labeled images and 62M+ unlabeled images.

teaser

News

Features of Depth Anything

  • Relative depth estimation:

    Our foundation models listed here can provide relative depth estimation for any given image robustly. Please refer here for details.

  • Metric depth estimation

    We fine-tune our Depth Anything model with metric depth information from NYUv2 or KITTI. It offers strong capabilities of both in-domain and zero-shot metric depth estimation. Please refer here for details.

  • Better depth-conditioned ControlNet

    We re-train a better depth-conditioned ControlNet based on Depth Anything. It offers more precise synthesis than the previous MiDaS-based ControlNet. Please refer here for details. You can also use our new ControlNet based on Depth Anything in ControlNet WebUI or ComfyUI's ControlNet.

  • Downstream high-level scene understanding

    The Depth Anything encoder can be fine-tuned to downstream high-level perception tasks, e.g., semantic segmentation, 86.2 mIoU on Cityscapes and 59.4 mIoU on ADE20K. Please refer here for details.

Performance

Here we compare our Depth Anything with the previously best MiDaS v3.1 BEiTL-512 model.

Please note that the latest MiDaS is also trained on KITTI and NYUv2, while we do not.

Method Params KITTI NYUv2 Sintel DDAD ETH3D DIODE
AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$
MiDaS 345.0M 0.127 0.850 0.048 0.980 0.587 0.699 0.251 0.766 0.139 0.867 0.075 0.942
Ours-S 24.8M 0.080 0.936 0.053 0.972 0.464 0.739 0.247 0.768 0.127 0.885 0.076 0.939
Ours-B 97.5M 0.080 0.939 0.046 0.979 0.432 0.756 0.232 0.786 0.126 0.884 0.069 0.946
Ours-L 335.3M 0.076 0.947 0.043 0.981 0.458 0.760 0.230 0.789 0.127 0.882 0.066 0.952

We highlight the best and second best results in bold and italic respectively (better results: AbsRel $\downarrow$ , $\delta_1 \uparrow$).

Pre-trained models

We provide three models of varying scales for robust relative depth estimation:

Model Params Inference Time on V100 (ms) A100 RTX4090 (TensorRT)
Depth-Anything-Small 24.8M 12 8 3
Depth-Anything-Base 97.5M 13 9 6
Depth-Anything-Large 335.3M 20 13 12

Note that the V100 and A100 inference time (without TensorRT) is computed by excluding the pre-processing and post-processing stages, whereas the last column RTX4090 (with TensorRT) is computed by including these two stages (please refer to Depth-Anything-TensorRT).

You can easily load our pre-trained models by:

from depth_anything.dpt import DepthAnything

encoder = 'vits' # can also be 'vitb' or 'vitl'
depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{:}14'.format(encoder))

Depth Anything is also supported in transformers. You can use it for depth prediction within 3 lines of code (credit to @niels).

No network connection, cannot load these models?

Click here for solutions
# suppose the config and checkpoint files are stored under the folder checkpoints/depth_anything_vitb14
depth_anything = DepthAnything.from_pretrained('checkpoints/depth_anything_vitb14', local_files_only=True)

Usage

Installation

git clone https://github.com/LiheYoung/Depth-Anything
cd Depth-Anything
pip install -r requirements.txt

Running

python run.py --encoder <vits | vitb | vitl> --img-path <img-directory | single-img | txt-file> --outdir <outdir>

For the img-path, you can either 1) point it to an image directory storing all interested images, 2) point it to a single image, or 3) point it to a text file storing all image paths.

For example:

python run.py --encoder vitl --img-path assets/examples --outdir depth_vis

If you want to use Depth Anything on videos:

python run_video.py --encoder vitl --video-path assets/examples_video --outdir video_depth_vis

Gradio demo

To use our gradio demo locally:

python app.py

You can also try our online demo.

Import Depth Anything to your project

If you want to use Depth Anything in your own project, you can simply follow run.py to load our models and define data pre-processing.

Code snippet (note the difference between our data pre-processing and that of MiDaS)
from depth_anything.dpt import DepthAnything
from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet

import cv2
import torch

encoder = 'vits' # can also be 'vitb' or 'vitl'
depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{:}14'.format(encoder)).eval()

transform = Compose([
    Resize(
        width=518,
        height=518,
        resize_target=False,
        keep_aspect_ratio=True,
        ensure_multiple_of=14,
        resize_method='lower_bound',
        image_interpolation_method=cv2.INTER_CUBIC,
    ),
    NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    PrepareForNet(),
])

image = cv2.cvtColor(cv2.imread('your image path'), cv2.COLOR_BGR2RGB) / 255.0
image = transform({'image': image})['image']
image = torch.from_numpy(image).unsqueeze(0)

# depth shape: 1xHxW
depth = depth_anything(image)

Do not want to define image pre-processing or download model definition files?

Easily use Depth Anything through transformers within 3 lines of code! Please refer to these instructions (credit to @niels).

Click here for a brief demo:
from transformers import pipeline
from PIL import Image

image = Image.open('Your-image-path')
pipe = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-small-hf")
depth = pipe(image)["depth"]

Community Support

We sincerely appreciate all the extentions built on our Depth Anything from the community. Thank you a lot!

Here we list the extensions we have found:

If you have your amazing projects supporting or improving (e.g., speed) Depth Anything, please feel free to drop an issue. We will add them here.

Acknowledgement

We would like to express our deepest gratitude to AK(@_akhaliq) and the awesome HuggingFace team (@niels, @hysts, and @yuvraj) for helping improve the online demo and build the HF models.

Besides, we thank the MagicEdit team for providing some video examples for video depth estimation, and Tiancheng Shen for evaluating the depth maps with MagicEdit.

Citation

If you find this project useful, please consider citing:

@article{depthanything,
      title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data}, 
      author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
      journal={arXiv:2401.10891},
      year={2024}
}

About

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%