-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize CLIPArchitecture #89
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks pretty good! Left a few comments, but other than the stuff about the forward outputs, they're all relatively minor
warnings.warn(f"Missing encoder for extra input {key}") | ||
|
||
# Return a dataclass object instead of a dictionary | ||
clip_output = make_dataclass( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason we want to return a dataclass here? Imo one of the main advantages of dataclasses is that they follow a fixed schema, so returning one dynamically feels a bit unnatural.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it feels unnatural (it took me a while to figure out how to make a dataclass dynamically). I used a dataclass to match the pattern set by other modules, but now I realize a lot of modules don't have it, so unless anyone is a strong proponent of output classes then I can return a dictionary instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The creation is dynamics but once created the schema is fixed.
An advantage of dataclass
is that we can use it for type hints.
The counterpart to dataclass is to use NamedTuple
if we don't intend for inheritance. But no strong preference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer NamedTuple
for consistency with all our other model outputs, unless there's a clear advantage of using dataclass
over NamedTuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creating NamedTuple dynamically causes issues with mypy such that i have to include # type: ignore
on the namedtuple creation line, but besides that i don't see other relative advantages of dataclass
for key in modalities.keys(): | ||
if key not in self.encoders: | ||
warnings.warn(f"Missing encoder for extra input {key}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your choice to raise a warning here makes sense. We might also want to do the same in late_fusion for the sake of consistency (doesn't have to be done in this PR though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes
I dont think we should change CLIP which is a "standard" model to make it play nice with mugen. Other options are to either have a different model if we want to finally get to another "Standard" model like video clip or have a version in examples/mugen
warnings.warn(f"Missing encoder for extra input {key}") | ||
|
||
# Return a dataclass object instead of a dictionary | ||
clip_output = make_dataclass( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The creation is dynamics but once created the schema is fixed.
An advantage of dataclass
is that we can use it for type hints.
The counterpart to dataclass is to use NamedTuple
if we don't intend for inheritance. But no strong preference here.
@ankitade This is not specific to MUGEN. Generalizing just in a sense that CLIP can compare more than 2 modalities. This is a common use case you might find in other research work. |
@langong347 I do see @ankitade's point here. At the very least this is no longer really (1) keep To me, the argument for (1) is that CLIP is a very important model and should be a first-class citizen with its own architecture. While the argument for (2) is better generality (I think we have said we should not have an architecture unless it is used by multiple models anyways). Personally, I would lean slightly towards (2) but would like to hear others' thoughts as well. |
Generalizing a SOTA model is not uncommon. This also relates to the discussion on "post paper model optimization". For example:
An architecture just represents a class of similar models. Initially, it could be based off a particular instance but it doesn't have to be restricted to where it came from. Compared to model builders, architectures are lower in the level. What we want to keep our fidelity to are the instances/builders while architecture is just the layer of abstraction beneath. No strong opinion about naming here. "CLIPArchitecture" is probably better as a reminder of its origin than "ConstrastiveArchitecture" which is a term that hasn't been coined in the public yet. |
@ebsmothers I'm leaning towards option 1. As for MUGEN, the linear projection layer after the encoder is slightly different than the CLIP paper (which only uses one linear layer I believe?): https://github.com/mugen-org/MUGEN_baseline/blob/02c7058cd221f4b651d4ace2276b085cac1c5efd/lib/models/videoclip/modules.py#L15. So that leads me to believe MUGEN should have its own As for supporting more than two encoders, I'm not convinced of the benefit for that over multiple CLIPs, other than convenience for getting all three embeddings at once for training or inference. That seems MUGEN specific, warranting the separate contrastive architecture for MUGEN anyway. |
|
||
def test_forward(self, start): | ||
clip, input_query, input_retrieval = start | ||
assert isinstance(clip, torch.nn.Module) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it's necessary to ensure that clip is a Module, I would remove this
warnings.warn(f"Missing encoder for extra input {key}") | ||
|
||
# Return a dataclass object instead of a dictionary | ||
clip_output = make_dataclass( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer NamedTuple
for consistency with all our other model outputs, unless there's a clear advantage of using dataclass
over NamedTuple
Codecov Report
@@ Coverage Diff @@
## main #89 +/- ##
==========================================
+ Coverage 88.37% 88.39% +0.01%
==========================================
Files 35 35
Lines 1850 1853 +3
==========================================
+ Hits 1635 1638 +3
Misses 215 215
Continue to review full report at Codecov.
|
|
Both the original CLIPArchitecture and this generalized CLIPArchitecture avoided explicitly including the projection layer(s) because users may want different types of projections and projection logic can be folded into the encoder that's passed in. We also can't guarantee that any projections passed in as arguments by the user have the same output size, so I don't see an advantage to including a projection argument. (Though this choice does assume that we want one general clip architecture and not two versions) |
The projection can be absorbed into the encoders (see Sophia's post), we can reuse the same CLIPArchitecture for arbitrary pair of modalities --- that's how CLIP has been extended for research. For that, hard-coding "text" and "image" in the keys of the output will not be suitable.
In MUGEN, the loss is computed pair-wise for 3 modalities and summed together. We could instantiate 3 clip instances with each yielding just the loss for the pair and combine them later in the lightning module. My main concern about generalization is supporting different pairs of modalities using the same architecture.
The |
Agree with both @langong347 and @sophiazhi's points about keeping the projection layer out of the architecture. Even in CLIP the projection layer is not guaranteed to be present (I think they have one in the ViT version but not the ResNet one). A corollary to this would be that MUGEN does not need its own architecture just because it has a different projection layer. I wouldn't recommend returning the similarity as part of the architecture though. Then we are starting to integrate our loss into the architecture, which we don't want in general. This had to be done for ALBEF because of how the similarities get used in the multimodal encoder, but this is also part of the reason that class was implemented as a model and not an architecture (since it then becomes much more specific to that particular model). Also, even simple old cosine similarity can be implemented in different ways, with both FLAVA and CLIP handling propagation of gradients differently. So I would keep this out and leave it up to the user how to use the embeddings. For @ankitade's 3rd point, agree that returning more than two modalities doesn't really make sense for zero shot. Though hopefully if the user is running zero shot (or contrastive with batch negatives), they wouldn't pass more than two modalities anyways. However, these assumptions plus excessive generality could potentially cause confusion for the users on the "flagship" instantiation of CLIPArchitecture (CLIP 😉 ). So ultimately I agree with @RdoubleA: leaving CLIPArchitecture as is feels like the right path. But I do think we should generalize the MUGEN architecture to return a dict (as opposed to an arbitrary pair). Otherwise in a case like this, we would have to call each of the encoders multiple times. (As an aside, this whole convo is yet another interesting test of our "don't generalize until you need to" principle...) |
@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary:
Generalize CLIPArchitecture to allow two encoders of any modalities and added a test suite for CLIPArchitecture. Ultimately, the goal is to support multimodal models beyond image/text, like MUGEN which uses audio/text/video.
Test plan:
Run command
pytest --cov=torchmultimodal/architectures/ test/architectures/test_clip.py::TestCLIPArchitecture -vv
to run the unit test included in this PR.