- Abu Dhabi, UAE
- https://www.muhammadmaaz.com
Lists (1)
Sort Name ascending (A-Z)
Stars
A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
Official implementation of paper titled "GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model"
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
[ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"
[WACV 2025] Efficient Video Object Segmentation via Modulated Cross-Attention Memory
MobiLlama : Small Language Model tailored for edge devices
(WACV 2025) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu.
VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning
Code for MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
[CVPR 2024 🔥] GeoChat, the first grounded Large Vision Language Model for Remote Sensing
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
[NeurIPS 2023] Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
[ICLR 2024] Official code for the paper "LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts"
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
This repository provides a comprehensive collection of research papers focused on multimodal representation learning, all of which have been cited and discussed in the survey just accepted https://…
[MICCAI 2023] Official code repository of paper titled "Frequency Domain Adversarial Training for Robust Volumetric Medical Segmentation" accepted in MICCAI 2023 conference.
[ICCV'23 Main Track, WECIA'23 Oral] Official repository of paper titled "Self-regulating Prompts: Foundational Model Adaptation without Forgetting".
[EMNLP'23] ClimateGPT: a specialized LLM for conversations related to Climate Change and Sustainability topics in both English and Arabic languages.
[BIONLP@ACL 2024] XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models.
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted fo…
Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)
Official implementation for the paper "Prompt Pre-Training with Over Twenty-Thousand Classes for Open-Vocabulary Visual Recognition"