Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can i give boolen mask as input #13

Open
Mythili-kannan opened this issue Feb 25, 2025 · 13 comments
Open

How can i give boolen mask as input #13

Mythili-kannan opened this issue Feb 25, 2025 · 13 comments

Comments

@Mythili-kannan
Copy link

Can i give bool mask of an object as input, and expect sam2 to detect that object across videos, if so , can you give a boiler plate code to do that

@heyoeyo
Copy link
Owner

heyoeyo commented Feb 25, 2025

Yes it's possible to begin tracking from a mask. There wasn't any easy to do this using the existing code, so I just added an extra function for doing this as well as a simple example to show how it works. You can find the example code under the 'simple examples' folder, it's the script:
video_segmentation_from_mask.py

You'll need to update to the newest commit to get that script as well as the newly added function for handling the initialization.

@Mythili-kannan
Copy link
Author

Thank you , so much , it works, how can i use it for multiple object , i tried some code, but since the initialize from mask function doesnt return best_obj_ptr, like it does for initialize video masking, im facing error, can you help to create sample script for video segmentation from mask for multiple objects

@heyoeyo
Copy link
Owner

heyoeyo commented Feb 26, 2025

Tracking multiple objects is just a matter of keeping separate memory storage for each object. These are the prompt/prev variables in the mask example. So for example, for tracking two objects, you could have:

# Initialize first tracked object
init_encoded_img_obj1, _, _ = sammodel.encode_image(init_image_obj1, **imgenc_config_dict)
init_mem_1 = sammodel.initialize_from_mask(init_encoded_img_obj1, init_mask_obj1)
prompt_mems_1 = deque([init_mem_1])
prompt_ptrs_1 = deque([])
prev_mems_1 = deque([], maxlen=6)
prev_ptrs_1 = deque([], maxlen=15)

# Initialize second tracked object
init_encoded_img_obj2, _, _ = sammodel.encode_image(init_image_obj2, **imgenc_config_dict)
init_mem_2 = sammodel.initialize_from_mask(init_encoded_img_obj2, init_mask_obj2)
prompt_mems_2 = deque([init_mem_2])
prompt_ptrs_2 = deque([])
prev_mems_2 = deque([], maxlen=6)
prev_ptrs_2 = deque([], maxlen=15)

Then during the video, you just need to run the 'step video' function using the memory data for each object:

# Update tracking for object 1
obj_score_1, mask_idx_1, mask_preds_1, mem_enc_1, obj_ptr_1 = sammodel.step_video_masking(
  encoded_imgs_list, prompt_mems_1, prompt_ptrs_1, prev_mems_1, prev_ptrs_1
)
if obj_score_1 > 0:
  prev_mems_1.appendleft(mem_enc_1)
  prev_ptrs_1.appendleft(obj_ptr_1)

# Update tracking for object 2
obj_score_2, mask_idx_2, mask_preds_2, mem_enc_2, obj_ptr_2 = sammodel.step_video_masking(
  encoded_imgs_list, prompt_mems_2, prompt_ptrs_2, prev_mems_2, prev_ptrs_2
)
if obj_score_2 > 0:
  prev_mems_2.appendleft(mem_enc_2)
  prev_ptrs_2.appendleft(obj_ptr_2)

Since this is a lot of copy/pasting the same things over and over, there is a helper object for managing the memory data called SAM2VideoObjectResults. I've just updated it so that it can be initialized without needing the object pointer, so that it can be used with the mask initialization. Re-doing the example above using this would look something like:

from lib.demo_helpers.video_data_storage import SAM2VideoObjectResults

# Initialize first tracked object
init_encoded_img_obj1, _, _ = sammodel.encode_image(init_image_obj1, **imgenc_config_dict)
init_mem_1 = sammodel.initialize_from_mask(init_encoded_img_obj1, init_mask_obj1)
obj_1_mem = SAM2VideoObjectResults.create()
obj_1_mem.store_prompt_result(0, init_mem_1)

# Initialize second tracked object
init_encoded_img_obj2, _, _ = sammodel.encode_image(init_image_obj2, **imgenc_config_dict)
init_mem_2 = sammodel.initialize_from_mask(init_encoded_img_obj2, init_mask_obj2)
obj_2_mem = SAM2VideoObjectResults.create()
obj_2_mem.store_prompt_result(0, init_mem_2)

And then the tracking would instead look like:

# Update tracking for object 1
obj_score_1, mask_idx_1, mask_preds_1, mem_enc_1, obj_ptr_1 = sammodel.step_video_masking(
  encoded_imgs_list, **obj_1_mem.to_dict()
)
if obj_score_1 > 0:
  obj_1_mem.store_result(frame_idx, mem_enc_1, obj_ptr_1)

# Update tracking for object 2
obj_score_2, mask_idx_2, mask_preds_2, mem_enc_2, obj_ptr_2 = sammodel.step_video_masking(
  encoded_imgs_list, **obj_2_mem.to_dict()
)
if obj_score_2 > 0:
  obj_2_mem.store_result(frame_idx, mem_enc_2, obj_ptr_2)

If you have many objects, then it makes sense to use loops to initialize/update all of the objects. The multi-object tracking script has an example of doing this with point/box prompts, but the idea would be the same with masks (you just need to load all of the associated images/masks instead of the box/point coordinates).

@Mythili-kannan
Copy link
Author

Mythili-kannan commented Feb 26, 2025

sure thank you,

but if i pass empty deque as obj_ptr, its throwing error, i refererd original samv2 repo and create a torch zeros as obj_ptr inside add_mask function , is that right way to do it? does it affect result?

        # Hard-code the object score as being 'high/confident', since we assume the given mask is accurate
        obj_score = torch.tensor(100.0, device=device, dtype=dtype)
        obj_ptrs = torch.zeros((1, 1, 256), device=device, dtype=dtype)

@heyoeyo
Copy link
Owner

heyoeyo commented Feb 26, 2025

if i pass empty deque as obj_ptr, its throwing error

An empty list/deque shouldn't cause any problems (that's how the video_segmentation_from_mask.py already works), maybe there's some other error occurring?

create a torch zeros as obj_ptr inside add_mask function , is that right way to do it? does it affect result?

Yes that should work, though it may slightly degrade the tracking since it's now looking for something that would have produced an object pointer of all zeros, which probably isn't true for any real objects. However, the tracking isn't very sensitive to the object pointers anyways, so it's probably fine.

@Mythili-kannan
Copy link
Author

Thank you so much, it works, one more small doubt,

is there any way we can utilize the features and mask predicted and classify the predicted mask, or internally anywhere classification is happening like mapping, mask with certain labels 0,1 so on

@heyoeyo
Copy link
Owner

heyoeyo commented Feb 27, 2025

internally anywhere classification is happening

The segment-anything models are a bit unusual in that they don't use a classifier token internally. That seems to be where the model name comes from actually (they aren't limited to segmenting based on trained labels, so they can segment 'anything'). So the SAM models on their own are likely very poorly suited to classification.

It might be possible to make a rough classifier based on something like the semantic similarity experiment script. The idea being to pre-compute image tokens for a bunch of example classes you want to label, then for some new masked image, compute the similarity with each of the pre-computed tokens, and whichever class scores highest is the classification prediction.
I think this could perform significantly better if the similarity calculation is replaced with a custom classifier to process the image tokens, however, that would require training a custom model which would be a lot more difficult.

Alternatively, the segmentation mask could be used to crop out the part of the image containing the object, and then an image classifier, like YOLO, could be used on the cropped image. This approach would likely be easier to implement, especially if you can find an image classifier that's already trained on the classes you're trying to label.

@Mythili-kannan
Copy link
Author

Thank you for the suggestion, im getting false positive segments , is it either the feature are similar or something else is issue im not sure, is there a way we can get confidence score, or any other way to mitigate the false positive segments

@heyoeyo
Copy link
Owner

heyoeyo commented Feb 28, 2025

is there a way we can get confidence score

The IoU output is intended to act as a confidence score. However, for SAMv2, I've found that the stability score sometimes works better as a predictor of mask quality.

any other way to mitigate the false positive segments

As for mitigating things, one option is to adjust the prompts. If you're using masks, then that would mean editing the masks themselves, otherwise you could also try using points or bounding boxes as prompts (and maybe negative points to exclude unwanted segments). If you're working with video and false positives show up over time, then it may help to provide additional prompts later on to 'reset' to a more accurate segmentation.

There are also other versions of SAM (like sam-hq, the v1 version of sam-hq is supported on this repo, under a different branch), which seem to do a lot better with fine details, in case that's related to the false positives you're getting.

Otherwise, having a model that is fine-tuned to the data you're working with is probably the best way to consistently improve results. The SAMv2 repo now has a training script to help with doing this, though it is probably a lot of work.

@Mythili-kannan
Copy link
Author

Mythili-kannan commented Mar 1, 2025

How can i use samhq with similar setup where prompt is as mask infer on some images

also is there any way to visualize attention from sam2

actually the problem im facing is , let say in my prompt image there is tiny product such as screw and the corresponding mask, in the video there are places where the screw isnt even there still its segmenting that region as screw, can you please help on that

@heyoeyo
Copy link
Owner

heyoeyo commented Mar 1, 2025

How can i use samhq with similar setup where prompt is as mask infer on some images

Since only the v1 model is supported for now, it can't be used the same way as video masking with v2 models. You can use the built-in mask prompting of the SAMv1 (or v2) models on images at least, for example if you use the run_image.py script and provide the --mask_path flag to point at a mask image. However, this generally doesn't work as well as what the v2/video masking is capable of.
It's also still possible to use the sam-hq v2 model in all the scripts, and seems to work ok (even though it's not properly supported), it may give better results than the original model in some cases.

also is there any way to visualize attention from sam2

The v2 models implement attention using an optimized scaled_dot_product_attention command, which makes it impossible to access the internal attention softmax results for visualization as-is. It is possible to implement the attention calculation more directly to get access to the softmax outputs, which I might do at some point in the future. It'll be a while until I can get to that, so if you wanted to make the change yourself, it would require first adding a softmax to the init of the attention class:

self.softmax = nn.Softmax(dim=-1)

And then replacing the existing scaled_dot_product_attention step with something like:

q = q * (self.features_per_head ** -0.5)
attn = q @ k.transpose(-2, -1)
value_weighting = self.softmax(attn)
output = (value_weighting @ v).transpose(1, 2).reshape(B, N, C)

Technically these changes would also need to be done to the pooled attention class as well, though those results may be quite hard to visualize.
With those changes, it would then be possible to capture the softmax results for visualization/analysis. It can be done the same way the block norm experiment script works:

captures = ModelOutputCapture(sammodel, target_modules=torch.nn.Softmax)
encoded_img, _, _ = sammodel.encode_image(full_image_bgr, **imgenc_config_dict)
# -> Attention results will be inside the 'captures' list

there are places where the screw isnt even there still its segmenting that region as screw

One thing to check is to make sure that the mask prompt is only being given once at the start of tracking. If the prompt gets repeated it could cause the model to be heavily biased towards segmenting that same region. The other thing to look out for is that the prompt should be close to wherever the object is when the tracking starts, since SAMv2 is more trained for tracking than detection. If the object starts far away from the prompt, it may not track properly.
If you do have a case that requires object detection, you could try something like DINOv which can apparently do image-based prompting/detection, or something like GroundedSAM2 which includes a detector model.

@Mythili-kannan
Copy link
Author

Mythili-kannan commented Mar 2, 2025

The other thing to look out for is that the prompt should be close to wherever the object is when the tracking starts, since SAMv2 is more trained for tracking than detection. If the object starts far away from the prompt, it may not track properly.

I didnt understand this case, for example, i have a image and prompt image mask of a screw, when i create memory from it and use that on video frames, in initial frames, there is no screws present in that region , still its segmenting in those places

@heyoeyo
Copy link
Owner

heyoeyo commented Mar 2, 2025

in initial frames, there is no screws present in that region , still its segmenting in those places

The SAMv2 model is trained for tracking rather than detection. So there's a built-in assumption that the first frame after the initial prompt is occurring immediately after the prompt frame, in time, so that the object is likely to be very close (in position) to the initial prompt position.
For example, you can see from their training data, there aren't any videos with sudden jumps/cuts where the objects suddenly change position in the video from one frame to the next.

Here's an example (using the cross-image segmentation experiment script) showing the behavior of the model. The top image is being used as a prompt for the bottom image (as if the bottom image is the first frame of a video):

Image

You can see how the model tends to segment things based on a combination of closeness in appearance as well as position in the image. So for example, a prompt over the eye of the right-most bird segments the ears of the right sheep. The left-most bird segments the left sheep and the black butterfly pattern at the top of the bird image segments the darker shapes at the top of the sheep image. So the model isn't looking to 'detect' the prompted object in other frames (i.e. it's not looking for birds in the sheep image). It's more like it assumes the same object is in the next frame and is close by, but maybe changed appearance. So a bird in one part of an image can be used to segment a sheep in the same region, even though they don't look that similar. The mismatch in appearance does tend to show up as a low object score though (the object score is 10 when matching the bird images to themselves, but around 1 when matching birds to sheep).

For the example you gave, it's probably the case that there's some distinct shape in the same region as the 'screw prompt', so it's as if the model thinks the screw 'changed appearance' from one frame (the prompt) to the next (initial frame of the video) and segments the object in that region.

If you're trying to detect an object based on an image (and not be affected by position), the DINOv model might be a better match (or the related t-rex model), since it doesn't have the same tracking assumptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants