Generalize CLIPArchitecture #89

sophiazhi · 2022-06-14T21:01:05Z

Summary:
Generalize CLIPArchitecture to allow two encoders of any modalities and added a test suite for CLIPArchitecture. Ultimately, the goal is to support multimodal models beyond image/text, like MUGEN which uses audio/text/video.

Test plan:
Run command pytest --cov=torchmultimodal/architectures/ test/architectures/test_clip.py::TestCLIPArchitecture -vv to run the unit test included in this PR.

ebsmothers

Overall this looks pretty good! Left a few comments, but other than the stuff about the forward outputs, they're all relatively minor

ebsmothers · 2022-06-15T01:21:34Z

torchmultimodal/architectures/clip.py

+                warnings.warn(f"Missing encoder for extra input {key}")
+
+        # Return a dataclass object instead of a dictionary
+        clip_output = make_dataclass(


Is there a specific reason we want to return a dataclass here? Imo one of the main advantages of dataclasses is that they follow a fixed schema, so returning one dynamically feels a bit unnatural.

I agree it feels unnatural (it took me a while to figure out how to make a dataclass dynamically). I used a dataclass to match the pattern set by other modules, but now I realize a lot of modules don't have it, so unless anyone is a strong proponent of output classes then I can return a dictionary instead

The creation is dynamics but once created the schema is fixed.

An advantage of dataclass is that we can use it for type hints.

The counterpart to dataclass is to use NamedTuple if we don't intend for inheritance. But no strong preference here.

I would prefer NamedTuple for consistency with all our other model outputs, unless there's a clear advantage of using dataclass over NamedTuple

creating NamedTuple dynamically causes issues with mypy such that i have to include # type: ignore on the namedtuple creation line, but besides that i don't see other relative advantages of dataclass

torchmultimodal/architectures/clip.py

ebsmothers · 2022-06-15T01:27:21Z

torchmultimodal/architectures/clip.py

+        for key in modalities.keys():
+            if key not in self.encoders:
+                warnings.warn(f"Missing encoder for extra input {key}")


I think your choice to raise a warning here makes sense. We might also want to do the same in late_fusion for the sake of consistency (doesn't have to be done in this PR though)

test/architectures/test_clip.py

ankitade

Thanks for the changes
I dont think we should change CLIP which is a "standard" model to make it play nice with mugen. Other options are to either have a different model if we want to finally get to another "Standard" model like video clip or have a version in examples/mugen

test/architectures/test_clip.py

torchmultimodal/architectures/clip.py

langong347 · 2022-06-15T15:34:08Z

torchmultimodal/architectures/clip.py

+                warnings.warn(f"Missing encoder for extra input {key}")
+
+        # Return a dataclass object instead of a dictionary
+        clip_output = make_dataclass(


The creation is dynamics but once created the schema is fixed.

An advantage of dataclass is that we can use it for type hints.

The counterpart to dataclass is to use NamedTuple if we don't intend for inheritance. But no strong preference here.

langong347 · 2022-06-15T15:36:38Z

@ankitade This is not specific to MUGEN. Generalizing just in a sense that CLIP can compare more than 2 modalities. This is a common use case you might find in other research work.

ebsmothers · 2022-06-15T15:57:48Z

@langong347 I do see @ankitade's point here. At the very least this is no longer really CLIPArchitecture since the LI in CLIP stand for language and image 🙂. A separate question is whether we want to keep CLIPArchitecture because it is a foundational model. Seems like the two options would be to either

(1) keep CLIPArchitecture as is and implement a separate generalized architecture to be used by MUGEN, or
(2) rename this to e.g. ContrastiveArchitecture and let both CLIP and MUGEN use it.

To me, the argument for (1) is that CLIP is a very important model and should be a first-class citizen with its own architecture. While the argument for (2) is better generality (I think we have said we should not have an architecture unless it is used by multiple models anyways). Personally, I would lean slightly towards (2) but would like to hear others' thoughts as well.

langong347 · 2022-06-15T18:11:11Z

Generalizing a SOTA model is not uncommon. This also relates to the discussion on "post paper model optimization". For example:

the original transformer model from "Attention is all you need" has spin off with a few variants, or being broken up into encoder and decoder only.
GPT model was originally proposed for text generation only but later generalized to video generation (video GPT), cross-modality generation (DALL-E)
Extend CLIP to text <> video retrieval is comparable in the "market share" to other tasks that use CLIP (see Paper with Code)

An architecture just represents a class of similar models. Initially, it could be based off a particular instance but it doesn't have to be restricted to where it came from. Compared to model builders, architectures are lower in the level. What we want to keep our fidelity to are the instances/builders while architecture is just the layer of abstraction beneath.

No strong opinion about naming here. "CLIPArchitecture" is probably better as a reminder of its origin than "ConstrastiveArchitecture" which is a term that hasn't been coined in the public yet.

RdoubleA · 2022-06-15T18:52:41Z

@ebsmothers I'm leaning towards option 1. CLIPArchitecture is just a convenient wrapper around encoder -> projection -> l2 norm -> output (on a separate note, where is the projection layer?). Since it has been used many times in different papers, I think that warrants it's own architecture even though it would be an image-text instance of a general ContrastiveArchitecture (much like Video VQVAE gets its own file even though it's an instance of VQVAE).

As for MUGEN, the linear projection layer after the encoder is slightly different than the CLIP paper (which only uses one linear layer I believe?): https://github.com/mugen-org/MUGEN_baseline/blob/02c7058cd221f4b651d4ace2276b085cac1c5efd/lib/models/videoclip/modules.py#L15. So that leads me to believe MUGEN should have its own ContrastiveArchitecture.

As for supporting more than two encoders, I'm not convinced of the benefit for that over multiple CLIPs, other than convenience for getting all three embeddings at once for training or inference. That seems MUGEN specific, warranting the separate contrastive architecture for MUGEN anyway.

RdoubleA · 2022-06-15T18:55:30Z

test/architectures/test_clip.py

+
+    def test_forward(self, start):
+        clip, input_query, input_retrieval = start
+        assert isinstance(clip, torch.nn.Module)


not sure if it's necessary to ensure that clip is a Module, I would remove this

test/architectures/test_clip.py

torchmultimodal/architectures/clip.py

RdoubleA · 2022-06-15T19:05:42Z

torchmultimodal/architectures/clip.py

+                warnings.warn(f"Missing encoder for extra input {key}")
+
+        # Return a dataclass object instead of a dictionary
+        clip_output = make_dataclass(


I would prefer NamedTuple for consistency with all our other model outputs, unless there's a clear advantage of using dataclass over NamedTuple

codecov-commenter · 2022-06-15T20:46:49Z

Codecov Report

Merging #89 (2df9dcd) into main (de4d037) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #89      +/-   ##
==========================================
+ Coverage   88.37%   88.39%   +0.01%     
==========================================
  Files          35       35              
  Lines        1850     1853       +3     
==========================================
+ Hits         1635     1638       +3     
  Misses        215      215

Impacted Files	Coverage Δ
torchmultimodal/architectures/clip.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de4d037...2df9dcd. Read the comment docs.

ankitade · 2022-06-15T21:58:02Z

In the final state, we should just have models/clip.py with the CLIPModel (some version of current CLIPArch) + different standard instantiations clip_vit16 etc (we already have this i think)
For video clip (standard one, not mugen specific, if they dont line up), there can be models/video_clip.py if we want to add it and it does something different like Rafi's point about projections being different
If we start making clip take in a dict of modalities, its confusing for people for whom contrastive loss with batch negatives work for 2 dimensions (aka modalities for us). Actually I don't understand how the loss and 0 shot will eventually work for more than 2 entries in the dict

sophiazhi · 2022-06-15T21:59:54Z

where is the projection layer?

Both the original CLIPArchitecture and this generalized CLIPArchitecture avoided explicitly including the projection layer(s) because users may want different types of projections and projection logic can be folded into the encoder that's passed in. We also can't guarantee that any projections passed in as arguments by the user have the same output size, so I don't see an advantage to including a projection argument. (Though this choice does assume that we want one general clip architecture and not two versions)

langong347 · 2022-06-15T23:16:00Z

For video clip (standard one, not mugen specific, if they dont line up), there can be models/video_clip.py if we want to add it and it does something different like Rafi's point about projections being different

The projection can be absorbed into the encoders (see Sophia's post), we can reuse the same CLIPArchitecture for arbitrary pair of modalities --- that's how CLIP has been extended for research. For that, hard-coding "text" and "image" in the keys of the output will not be suitable.

If we start making clip take in a dict of modalities, its confusing for people for whom contrastive loss with batch negatives work for 2 dimensions (aka modalities for us). Actually I don't understand how the loss and 0 shot will eventually work for more than 2 entries in the dict

In MUGEN, the loss is computed pair-wise for 3 modalities and summed together. We could instantiate 3 clip instances with each yielding just the loss for the pair and combine them later in the lightning module. My main concern about generalization is supporting different pairs of modalities using the same architecture.

CLIPArchitecture is just a convenient wrapper around encoder -> projection -> l2 norm -> output

The CLIPArchitecture only carries partial features from the CLIP model, for example, cosine similarity is a common feature used in many research works. I know we can grab the latter from contrastive_loss.py but alternatively we could also think of returning compound output from forward including embeddings + cosine similarity as the two items are closely related anyways. (MUGEN adds similarity computation as a method of their CLIP but that becomes just a utility inside a class)

ebsmothers · 2022-06-16T04:33:10Z

Agree with both @langong347 and @sophiazhi's points about keeping the projection layer out of the architecture. Even in CLIP the projection layer is not guaranteed to be present (I think they have one in the ViT version but not the ResNet one). A corollary to this would be that MUGEN does not need its own architecture just because it has a different projection layer.

I wouldn't recommend returning the similarity as part of the architecture though. Then we are starting to integrate our loss into the architecture, which we don't want in general. This had to be done for ALBEF because of how the similarities get used in the multimodal encoder, but this is also part of the reason that class was implemented as a model and not an architecture (since it then becomes much more specific to that particular model). Also, even simple old cosine similarity can be implemented in different ways, with both FLAVA and CLIP handling propagation of gradients differently. So I would keep this out and leave it up to the user how to use the embeddings.

For @ankitade's 3rd point, agree that returning more than two modalities doesn't really make sense for zero shot. Though hopefully if the user is running zero shot (or contrastive with batch negatives), they wouldn't pass more than two modalities anyways. However, these assumptions plus excessive generality could potentially cause confusion for the users on the "flagship" instantiation of CLIPArchitecture (CLIP 😉 ).

So ultimately I agree with @RdoubleA: leaving CLIPArchitecture as is feels like the right path. But I do think we should generalize the MUGEN architecture to return a dict (as opposed to an arbitrary pair). Otherwise in a case like this, we would have to call each of the encoders multiple times.

(As an aside, this whole convo is yet another interesting test of our "don't generalize until you need to" principle...)

facebook-github-bot · 2022-06-17T20:12:17Z

@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

sophiazhi added 5 commits June 10, 2022 21:20

generalize CLIPArchitecture

a2ac41e

generalize variable names and documentation in contrastive loss module

ca3b721

add tests for CLIPArchitecture

02112ec

add output type to CLIPArchitecture forward docstring

36d4486

Merge remote-tracking branch 'upstream/main' into szhi-generalize_clip

968fca6

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 14, 2022

sophiazhi marked this pull request as draft June 14, 2022 21:10

sophiazhi added 2 commits June 14, 2022 21:34

revert changes to contrastive loss module

f4ad7bc

undo whitespace changes

3249c8b

ebsmothers requested changes Jun 15, 2022

View reviewed changes

ankitade requested changes Jun 15, 2022

View reviewed changes

sophiazhi requested a review from langong347 June 15, 2022 02:35

langong347 reviewed Jun 15, 2022

View reviewed changes

langong347 requested a review from RdoubleA June 15, 2022 15:35

RdoubleA suggested changes Jun 15, 2022

View reviewed changes

small design changes

7e18d8b

sophiazhi added 2 commits June 15, 2022 21:34

type errors in clip model builders, switch dataclass to namedtuple

f258029

Merge remote-tracking branch 'upstream/main' into szhi-generalize_clip

db06733

sophiazhi added 2 commits June 16, 2022 22:36

generalized cliparchitecture supports exactly two modalities

3eab9eb

Merge remote-tracking branch 'upstream/main' into szhi-generalize_clip

8c18100

langong347 approved these changes Jun 17, 2022

View reviewed changes

Merge branch 'main' into szhi-generalize_clip

2df9dcd

sophiazhi marked this pull request as ready for review June 17, 2022 17:47

facebook-github-bot closed this in 64b3873 Jun 20, 2022

sophiazhi deleted the szhi-generalize_clip branch June 21, 2022 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize CLIPArchitecture #89

Generalize CLIPArchitecture #89

sophiazhi commented Jun 14, 2022 •

edited

Loading

ebsmothers left a comment

ebsmothers Jun 15, 2022

sophiazhi Jun 15, 2022

langong347 Jun 15, 2022

RdoubleA Jun 15, 2022

sophiazhi Jun 15, 2022

ebsmothers Jun 15, 2022

ankitade left a comment

langong347 Jun 15, 2022

langong347 commented Jun 15, 2022

ebsmothers commented Jun 15, 2022

langong347 commented Jun 15, 2022 •

edited

Loading

RdoubleA commented Jun 15, 2022

RdoubleA Jun 15, 2022

RdoubleA Jun 15, 2022

codecov-commenter commented Jun 15, 2022 •

edited

Loading

ankitade commented Jun 15, 2022

sophiazhi commented Jun 15, 2022

langong347 commented Jun 15, 2022 •

edited

Loading

ebsmothers commented Jun 16, 2022

facebook-github-bot commented Jun 17, 2022

Generalize CLIPArchitecture #89

Generalize CLIPArchitecture #89

Conversation

sophiazhi commented Jun 14, 2022 • edited Loading

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankitade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

langong347 commented Jun 15, 2022

ebsmothers commented Jun 15, 2022

langong347 commented Jun 15, 2022 • edited Loading

RdoubleA commented Jun 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 15, 2022 • edited Loading

Codecov Report

ankitade commented Jun 15, 2022

sophiazhi commented Jun 15, 2022

langong347 commented Jun 15, 2022 • edited Loading

ebsmothers commented Jun 16, 2022

facebook-github-bot commented Jun 17, 2022

sophiazhi commented Jun 14, 2022 •

edited

Loading

langong347 commented Jun 15, 2022 •

edited

Loading

codecov-commenter commented Jun 15, 2022 •

edited

Loading

langong347 commented Jun 15, 2022 •

edited

Loading