-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can i give boolen mask as input #13
Comments
Yes it's possible to begin tracking from a mask. There wasn't any easy to do this using the existing code, so I just added an extra function for doing this as well as a simple example to show how it works. You can find the example code under the 'simple examples' folder, it's the script: You'll need to update to the newest commit to get that script as well as the newly added function for handling the initialization. |
Thank you , so much , it works, how can i use it for multiple object , i tried some code, but since the initialize from mask function doesnt return best_obj_ptr, like it does for initialize video masking, im facing error, can you help to create sample script for video segmentation from mask for multiple objects |
Tracking multiple objects is just a matter of keeping separate memory storage for each object. These are the prompt/prev variables in the mask example. So for example, for tracking two objects, you could have: # Initialize first tracked object
init_encoded_img_obj1, _, _ = sammodel.encode_image(init_image_obj1, **imgenc_config_dict)
init_mem_1 = sammodel.initialize_from_mask(init_encoded_img_obj1, init_mask_obj1)
prompt_mems_1 = deque([init_mem_1])
prompt_ptrs_1 = deque([])
prev_mems_1 = deque([], maxlen=6)
prev_ptrs_1 = deque([], maxlen=15)
# Initialize second tracked object
init_encoded_img_obj2, _, _ = sammodel.encode_image(init_image_obj2, **imgenc_config_dict)
init_mem_2 = sammodel.initialize_from_mask(init_encoded_img_obj2, init_mask_obj2)
prompt_mems_2 = deque([init_mem_2])
prompt_ptrs_2 = deque([])
prev_mems_2 = deque([], maxlen=6)
prev_ptrs_2 = deque([], maxlen=15) Then during the video, you just need to run the 'step video' function using the memory data for each object: # Update tracking for object 1
obj_score_1, mask_idx_1, mask_preds_1, mem_enc_1, obj_ptr_1 = sammodel.step_video_masking(
encoded_imgs_list, prompt_mems_1, prompt_ptrs_1, prev_mems_1, prev_ptrs_1
)
if obj_score_1 > 0:
prev_mems_1.appendleft(mem_enc_1)
prev_ptrs_1.appendleft(obj_ptr_1)
# Update tracking for object 2
obj_score_2, mask_idx_2, mask_preds_2, mem_enc_2, obj_ptr_2 = sammodel.step_video_masking(
encoded_imgs_list, prompt_mems_2, prompt_ptrs_2, prev_mems_2, prev_ptrs_2
)
if obj_score_2 > 0:
prev_mems_2.appendleft(mem_enc_2)
prev_ptrs_2.appendleft(obj_ptr_2) Since this is a lot of copy/pasting the same things over and over, there is a helper object for managing the memory data called from lib.demo_helpers.video_data_storage import SAM2VideoObjectResults
# Initialize first tracked object
init_encoded_img_obj1, _, _ = sammodel.encode_image(init_image_obj1, **imgenc_config_dict)
init_mem_1 = sammodel.initialize_from_mask(init_encoded_img_obj1, init_mask_obj1)
obj_1_mem = SAM2VideoObjectResults.create()
obj_1_mem.store_prompt_result(0, init_mem_1)
# Initialize second tracked object
init_encoded_img_obj2, _, _ = sammodel.encode_image(init_image_obj2, **imgenc_config_dict)
init_mem_2 = sammodel.initialize_from_mask(init_encoded_img_obj2, init_mask_obj2)
obj_2_mem = SAM2VideoObjectResults.create()
obj_2_mem.store_prompt_result(0, init_mem_2) And then the tracking would instead look like: # Update tracking for object 1
obj_score_1, mask_idx_1, mask_preds_1, mem_enc_1, obj_ptr_1 = sammodel.step_video_masking(
encoded_imgs_list, **obj_1_mem.to_dict()
)
if obj_score_1 > 0:
obj_1_mem.store_result(frame_idx, mem_enc_1, obj_ptr_1)
# Update tracking for object 2
obj_score_2, mask_idx_2, mask_preds_2, mem_enc_2, obj_ptr_2 = sammodel.step_video_masking(
encoded_imgs_list, **obj_2_mem.to_dict()
)
if obj_score_2 > 0:
obj_2_mem.store_result(frame_idx, mem_enc_2, obj_ptr_2) If you have many objects, then it makes sense to use loops to initialize/update all of the objects. The multi-object tracking script has an example of doing this with point/box prompts, but the idea would be the same with masks (you just need to load all of the associated images/masks instead of the box/point coordinates). |
sure thank you, but if i pass empty deque as obj_ptr, its throwing error, i refererd original samv2 repo and create a torch zeros as obj_ptr inside add_mask function , is that right way to do it? does it affect result?
|
An empty list/deque shouldn't cause any problems (that's how the video_segmentation_from_mask.py already works), maybe there's some other error occurring?
Yes that should work, though it may slightly degrade the tracking since it's now looking for something that would have produced an object pointer of all zeros, which probably isn't true for any real objects. However, the tracking isn't very sensitive to the object pointers anyways, so it's probably fine. |
Thank you so much, it works, one more small doubt, is there any way we can utilize the features and mask predicted and classify the predicted mask, or internally anywhere classification is happening like mapping, mask with certain labels 0,1 so on |
The segment-anything models are a bit unusual in that they don't use a classifier token internally. That seems to be where the model name comes from actually (they aren't limited to segmenting based on trained labels, so they can segment 'anything'). So the SAM models on their own are likely very poorly suited to classification. It might be possible to make a rough classifier based on something like the semantic similarity experiment script. The idea being to pre-compute image tokens for a bunch of example classes you want to label, then for some new masked image, compute the similarity with each of the pre-computed tokens, and whichever class scores highest is the classification prediction. Alternatively, the segmentation mask could be used to crop out the part of the image containing the object, and then an image classifier, like YOLO, could be used on the cropped image. This approach would likely be easier to implement, especially if you can find an image classifier that's already trained on the classes you're trying to label. |
Thank you for the suggestion, im getting false positive segments , is it either the feature are similar or something else is issue im not sure, is there a way we can get confidence score, or any other way to mitigate the false positive segments |
The IoU output is intended to act as a confidence score. However, for SAMv2, I've found that the stability score sometimes works better as a predictor of mask quality.
As for mitigating things, one option is to adjust the prompts. If you're using masks, then that would mean editing the masks themselves, otherwise you could also try using points or bounding boxes as prompts (and maybe negative points to exclude unwanted segments). If you're working with video and false positives show up over time, then it may help to provide additional prompts later on to 'reset' to a more accurate segmentation. There are also other versions of SAM (like sam-hq, the v1 version of sam-hq is supported on this repo, under a different branch), which seem to do a lot better with fine details, in case that's related to the false positives you're getting. Otherwise, having a model that is fine-tuned to the data you're working with is probably the best way to consistently improve results. The SAMv2 repo now has a training script to help with doing this, though it is probably a lot of work. |
How can i use samhq with similar setup where prompt is as mask infer on some images also is there any way to visualize attention from sam2 actually the problem im facing is , let say in my prompt image there is tiny product such as screw and the corresponding mask, in the video there are places where the screw isnt even there still its segmenting that region as screw, can you please help on that |
Since only the v1 model is supported for now, it can't be used the same way as video masking with v2 models. You can use the built-in mask prompting of the SAMv1 (or v2) models on images at least, for example if you use the run_image.py script and provide the
The v2 models implement attention using an optimized scaled_dot_product_attention command, which makes it impossible to access the internal attention softmax results for visualization as-is. It is possible to implement the attention calculation more directly to get access to the softmax outputs, which I might do at some point in the future. It'll be a while until I can get to that, so if you wanted to make the change yourself, it would require first adding a softmax to the init of the attention class: self.softmax = nn.Softmax(dim=-1) And then replacing the existing scaled_dot_product_attention step with something like: q = q * (self.features_per_head ** -0.5)
attn = q @ k.transpose(-2, -1)
value_weighting = self.softmax(attn)
output = (value_weighting @ v).transpose(1, 2).reshape(B, N, C) Technically these changes would also need to be done to the pooled attention class as well, though those results may be quite hard to visualize. captures = ModelOutputCapture(sammodel, target_modules=torch.nn.Softmax)
encoded_img, _, _ = sammodel.encode_image(full_image_bgr, **imgenc_config_dict)
# -> Attention results will be inside the 'captures' list
One thing to check is to make sure that the mask prompt is only being given once at the start of tracking. If the prompt gets repeated it could cause the model to be heavily biased towards segmenting that same region. The other thing to look out for is that the prompt should be close to wherever the object is when the tracking starts, since SAMv2 is more trained for tracking than detection. If the object starts far away from the prompt, it may not track properly. |
I didnt understand this case, for example, i have a image and prompt image mask of a screw, when i create memory from it and use that on video frames, in initial frames, there is no screws present in that region , still its segmenting in those places |
The SAMv2 model is trained for tracking rather than detection. So there's a built-in assumption that the first frame after the initial prompt is occurring immediately after the prompt frame, in time, so that the object is likely to be very close (in position) to the initial prompt position. Here's an example (using the cross-image segmentation experiment script) showing the behavior of the model. The top image is being used as a prompt for the bottom image (as if the bottom image is the first frame of a video): You can see how the model tends to segment things based on a combination of closeness in appearance as well as position in the image. So for example, a prompt over the eye of the right-most bird segments the ears of the right sheep. The left-most bird segments the left sheep and the black butterfly pattern at the top of the bird image segments the darker shapes at the top of the sheep image. So the model isn't looking to 'detect' the prompted object in other frames (i.e. it's not looking for birds in the sheep image). It's more like it assumes the same object is in the next frame and is close by, but maybe changed appearance. So a bird in one part of an image can be used to segment a sheep in the same region, even though they don't look that similar. The mismatch in appearance does tend to show up as a low object score though (the object score is 10 when matching the bird images to themselves, but around 1 when matching birds to sheep). For the example you gave, it's probably the case that there's some distinct shape in the same region as the 'screw prompt', so it's as if the model thinks the screw 'changed appearance' from one frame (the prompt) to the next (initial frame of the video) and segments the object in that region. If you're trying to detect an object based on an image (and not be affected by position), the DINOv model might be a better match (or the related t-rex model), since it doesn't have the same tracking assumptions. |
Can i give bool mask of an object as input, and expect sam2 to detect that object across videos, if so , can you give a boiler plate code to do that
The text was updated successfully, but these errors were encountered: