Skip to content

Commit

Permalink
Add video and multimodal works in README (mli#208)
Browse files Browse the repository at this point in the history
  • Loading branch information
bryanyzhu authored Oct 10, 2022
1 parent 5566c4f commit d950a46
Showing 1 changed file with 37 additions and 1 deletion.
38 changes: 37 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,6 @@
| 已录制 | 年份 | 名字 | 简介 | 引用 |
| ------ | ---- | ------------------------------------------------------------ | -------------------- | ------------------------------------------------------------ |
|| 2020 | [ViT](https://arxiv.org/pdf/2010.11929.pdf) | Transformer杀入CV界 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F7b15fa1b8d413fbe14ef7a97f651f47f5aff3903%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/An-Image-is-Worth-16x16-Words%3A-Transformers-for-at-Dosovitskiy-Beyer/7b15fa1b8d413fbe14ef7a97f651f47f5aff3903) |
|| 2021 | [CLIP](https://openai.com/blog/clip/) | 图片和文本之间的对比学习 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F6f870f7f02a8c59c3e23f407f3ef00dd1dcf8fc4%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Learning-Transferable-Visual-Models-From-Natural-Radford-Kim/6f870f7f02a8c59c3e23f407f3ef00dd1dcf8fc4) |
|| 2021 | [Swin Transformer](https://arxiv.org/pdf/2103.14030.pdf) | 多层次的Vision Transformer | [![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fc8b25fab5608c3e033d34b4483ec47e68ba109b7%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Swin-Transformer%3A-Hierarchical-Vision-Transformer-Liu-Lin/c8b25fab5608c3e033d34b4483ec47e68ba109b7) |
| | 2021 | [MLP-Mixer](https://arxiv.org/pdf/2105.01601.pdf) | 使用MLP替换self-attention |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F2def61f556f9a5576ace08911496b7c7e4f970a4%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/MLP-Mixer%3A-An-all-MLP-Architecture-for-Vision-Tolstikhin-Houlsby/2def61f556f9a5576ace08911496b7c7e4f970a4) |
|| 2021 | [MAE](https://arxiv.org/pdf/2111.06377.pdf) | BERT的CV版 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fc1962a8cf364595ed2838a097e9aa7cd159d3118%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Masked-Autoencoders-Are-Scalable-Vision-Learners-He-Chen/c1962a8cf364595ed2838a097e9aa7cd159d3118) |
Expand All @@ -98,6 +97,7 @@
| | 2021 | [Improved DDPM](https://arxiv.org/pdf/2102.09672.pdf) | 改进的 DDPM |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fde18baa4964804cf471d85a5a090498242d2e79f%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Improved-Denoising-Diffusion-Probabilistic-Models-Nichol-Dhariwal/de18baa4964804cf471d85a5a090498242d2e79f) |
| | 2021 | [Guided Diffusion Models](https://arxiv.org/pdf/2105.05233.pdf) | 号称超越 GAN |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F64ea8f180d0682e6c18d1eb688afdb2027c02794%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Diffusion-Models-Beat-GANs-on-Image-Synthesis-Dhariwal-Nichol/64ea8f180d0682e6c18d1eb688afdb2027c02794) |
| | 2021 | [StyleGAN3](https://arxiv.org/pdf/2106.12423.pdf) | |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fc1ff08b59f00c44f34dfdde55cd53370733a2c19%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Alias-Free-Generative-Adversarial-Networks-Karras-Aittala/c1ff08b59f00c44f34dfdde55cd53370733a2c19) |
|| 2022 | [DALL.E 2](https://arxiv.org/pdf/2204.06125.pdf) | CLIP + Diffusion models,文本生成图像新高度 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fc57293882b2561e1ba03017902df9fc2f289dea2%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Hierarchical-Text-Conditional-Image-Generation-with-Ramesh-Dhariwal/c57293882b2561e1ba03017902df9fc2f289dea2) |

### 计算机视觉 - Object Detection

Expand Down Expand Up @@ -136,6 +136,42 @@
|| 2021 | [DINO](https://arxiv.org/pdf/2104.14294.pdf) | transformer加自监督在视觉也很香 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fad4a0938c48e61b7827869e4ac3baffd0aefab35%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Emerging-Properties-in-Self-Supervised-Vision-Caron-Touvron/ad4a0938c48e61b7827869e4ac3baffd0aefab35) |


### 计算机视觉 - 视频理解

| 已录制 | 年份 | 名字 | 简介 | 引用 |
| ------ | ---- | ------------------------------------------------------------ | -------------------- | ------------------------------------------------------------ |
|| 2014 | [DeepVideo](https://cs.stanford.edu/people/karpathy/deepvideo/) | 提出sports1M数据集,用深度学习做视频理解 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F6d4c9c923e9f145d1c01a2de2afc38ec23c44253%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Large-Scale-Video-Classification-with-Convolutional-Karpathy-Toderici/6d4c9c923e9f145d1c01a2de2afc38ec23c44253) |
|| 2014 | [Two-stream](https://arxiv.org/pdf/1406.2199.pdf) | 引入光流做时序建模,神经网络首次超越手工特征 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F67dccc9a856b60bdc4d058d83657a089b8ad4486%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Two-Stream-Convolutional-Networks-for-Action-in-Simonyan-Zisserman/67dccc9a856b60bdc4d058d83657a089b8ad4486) |
|| 2014 | [C3D](https://arxiv.org/pdf/1412.0767.pdf) | 比较深的3D-CNN做视频理解 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fd25c65d261ea0e6a458be4c50c40ffe5bc508f77%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Learning-Spatiotemporal-Features-with-3D-Networks-Tran-Bourdev/d25c65d261ea0e6a458be4c50c40ffe5bc508f77) |
|| 2015 | [Beyond-short-snippets](https://arxiv.org/pdf/1503.08909.pdf) | 尝试使用LSTM |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F5418b2a482720e013d487a385c26fae0f017c6a6%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Beyond-short-snippets%3A-Deep-networks-for-video-Ng-Hausknecht/5418b2a482720e013d487a385c26fae0f017c6a6) |
|| 2016 | [Convolutional fusion](https://arxiv.org/pdf/1604.06573.pdf) | 做early fusion来加强时空间建模 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F9d9aced120e530484609164c836da64548693484%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Convolutional-Two-Stream-Network-Fusion-for-Video-Feichtenhofer-Pinz/9d9aced120e530484609164c836da64548693484) |
|| 2016 | [TSN](https://arxiv.org/pdf/1608.00859.pdf) | 超级有效的视频分段建模,bag of tricks in video |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fea3d7de6c0880e14455b9acb28f1bc1234321456%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Temporal-Segment-Networks%3A-Towards-Good-Practices-Wang-Xiong/ea3d7de6c0880e14455b9acb28f1bc1234321456) |
|| 2017 | [I3D](https://arxiv.org/pdf/1705.07750.pdf) | 提出Kinetics数据集,膨胀2D网络到3D,开启3D-CNN时代 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fb61a3f8b80bbd44f24544dc915f52fd30bbdf485%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Quo-Vadis%2C-Action-Recognition-A-New-Model-and-the-Carreira-Zisserman/b61a3f8b80bbd44f24544dc915f52fd30bbdf485) |
|| 2017 | [R2+1D](https://arxiv.org/pdf/1711.11248.pdf) | 拆分3D卷积核,使3D网络容易优化 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F89c3050522a0bb9820c32dc7444e003ef0d3e2e4%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/A-Closer-Look-at-Spatiotemporal-Convolutions-for-Tran-Wang/89c3050522a0bb9820c32dc7444e003ef0d3e2e4) |
|| 2017 | [Non-local](https://arxiv.org/pdf/1711.07971.pdf) | 引入自注意力做视觉问题 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F8899094797e82c5c185a0893896320ef77f60e64%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Non-local-Neural-Networks-Wang-Girshick/8899094797e82c5c185a0893896320ef77f60e64) |
|| 2018 | [SlowFast](https://arxiv.org/pdf/1812.03982.pdf) | 快慢两支提升效率 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F8b47b9c3c35b2b2a78bff7822605b3040f87d699%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/SlowFast-Networks-for-Video-Recognition-Feichtenhofer-Fan/8b47b9c3c35b2b2a78bff7822605b3040f87d699) |
|| 2021 | [TimeSformer](https://arxiv.org/pdf/2102.05095.pdf) | 视频中第一个引入transformer,开启video transformer时代 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fc143ea9e30b1f2d93a9c060253845423f9e60e1f%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Is-Space-Time-Attention-All-You-Need-for-Video-Bertasius-Wang/c143ea9e30b1f2d93a9c060253845423f9e60e1f) |


### 多模态学习


| 已录制 | 年份 | 名字 | 简介 | 引用 |
| ------ | ---- | ------------------------------------------------------------ | -------------------- | ------------------------------------------------------------ |
|| 2021 | [CLIP](https://openai.com/blog/clip/) | 图片和文本之间的对比学习 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F6f870f7f02a8c59c3e23f407f3ef00dd1dcf8fc4%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Learning-Transferable-Visual-Models-From-Natural-Radford-Kim/6f870f7f02a8c59c3e23f407f3ef00dd1dcf8fc4) |
|| 2021 | [ViLT](https://arxiv.org/pdf/2102.03334.pdf) | 第一个摆脱了目标检测的视觉文本模型 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F0839722fb5369c0abaff8515bfc08299efc790a1%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/ViLT%3A-Vision-and-Language-Transformer-Without-or-Kim-Son/0839722fb5369c0abaff8515bfc08299efc790a1) |
|| 2021 | [ViLD](https://arxiv.org/pdf/2104.13921.pdf) | CLIP蒸馏帮助开集目标检测 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fcf9b8da26d9b92e75ba49616ed2a1033f59fce14%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Open-vocabulary-Object-Detection-via-Vision-and-Gu-Lin/cf9b8da26d9b92e75ba49616ed2a1033f59fce14) |
|| 2021 | [GLIP](https://arxiv.org/pdf/2112.03857.pdf) | 联合目标检测和文本定位 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F5341b412383c43f4a693ad63ec4489e3ec7688c8%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Grounded-Language-Image-Pre-training-Li-Zhang/5341b412383c43f4a693ad63ec4489e3ec7688c8) |
|| 2021 | [CLIP4Clip](https://arxiv.org/pdf/2104.08860.pdf) | 拿CLIP直接做视频文本retrieval |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F281ad83e06d731d5d686acf07cd701576f1188c4%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/CLIP4Clip%3A-An-Empirical-Study-of-CLIP-for-End-to-Luo-Ji/281ad83e06d731d5d686acf07cd701576f1188c4) |
|| 2021 | [ActionCLIP](https://arxiv.org/pdf/2109.08472.pdf) | 用多模态对比学习有监督的做视频动作分类 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fdc05240a06326b5b1664f7e8c95c330b08cd0349%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/ActionCLIP%3A-A-New-Paradigm-for-Video-Action-Wang-Xing/dc05240a06326b5b1664f7e8c95c330b08cd0349) |
|| 2021 | [PointCLIP](https://arxiv.org/pdf/2112.02413.pdf) | 3D变2D,巧妙利用CLIP做点云 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Ff3ce9ba3fcec362b70263a7ed63d9404975496a0%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/PointCLIP%3A-Point-Cloud-Understanding-by-CLIP-Zhang-Guo/f3ce9ba3fcec362b70263a7ed63d9404975496a0) |
|| 2022 | [LSeg](https://arxiv.org/pdf/2201.03546.pdf) | 有监督的开集分割 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2Fcc9826c222ac1e81b4b374dd9e0df130f298b1e8%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Language-driven-Semantic-Segmentation-Li-Weinberger/cc9826c222ac1e81b4b374dd9e0df130f298b1e8) |
|| 2022 | [GroupViT](https://arxiv.org/pdf/2202.11094.pdf) | 只用图像文本对也能无监督做分割 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F0b5f27a5766c5d1394a6282ad94fec21d620bd6b%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/GroupViT%3A-Semantic-Segmentation-Emerges-from-Text-Xu-Mello/0b5f27a5766c5d1394a6282ad94fec21d620bd6b) |
|| 2022 | [CLIPasso](https://arxiv.org/pdf/2202.05822.pdf) | CLIP跨界生成简笔画 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F9dec819778bebae4a468c7813f7638534c826f52%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/CLIPasso%3A-Semantically-Aware-Object-Sketching-Vinker-Pajouheshgar/9dec819778bebae4a468c7813f7638534c826f52) |
|| 2022 | [DepthCLIP](https://arxiv.org/pdf/2207.01077.pdf) | 用文本跨界估计深度 |[![citation](https://img.shields.io/badge/dynamic/json?label=citation&query=citationCount&url=https%3A%2F%2Fapi.semanticscholar.org%2Fgraph%2Fv1%2Fpaper%2F9d0afe58801fe9e5537902e853d6e9e385340a92%3Ffields%3DcitationCount)](https://www.semanticscholar.org/paper/Can-Language-Understand-Depth-Zhang-Zeng/9d0afe58801fe9e5537902e853d6e9e385340a92) |



### 自然语言处理 - Transformer

| 已录制 | 年份 | 名字 | 简介 | 引用 |
Expand Down

0 comments on commit d950a46

Please sign in to comment.