AnyModal is a modular and extensible framework for integrating diverse input modalities (e.g., images, audio) into large language models (LLMs). It enables seamless tokenization, encoding, and language generation using pre-trained models for various modalities.
- Flexible Integration: Easily plug in different input modalities like vision, audio, and structured data.
- Tokenization Support: Tokenizes inputs from non-text modalities and combines them with LLMs for generation.
- Extensible Design: Add new input processors and tokenizers with minimal code changes.
The best way to get started with AnyModal is to have a read-through of the steps below and then see the examples provided in the demos
directory. Also, check out the anymodal.py
file to understand the core components of the framework.
To use AnyModal in your project, follow these steps:
-
Copy
anymodal.py
:
Copy theanymodal.py
file into your project directory. -
Install Dependencies:
Ensure the following dependencies are installed:pip install torch transformers datasets torchvision tqdm
You may also need to install additional dependencies based on your use case.
AnyModal requires three core components for input processing:
- Input Processor: Processes raw input data into a format compatible with the encoder.
- Input Encoder: Encodes the processed data into feature representations.
- Input Tokenizer: Projects the encoded features into a token embedding space.
Example of integrating an image modality using Vision Transformer:
from transformers import ViTImageProcessor, ViTForImageClassification
from anymodal import MultiModalModel
from vision import VisionEncoder, Projector
# Load vision processor and model
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
vision_model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
hidden_size = vision_model.config.hidden_size
# Initialize vision encoder and projector
vision_encoder = VisionEncoder(vision_model)
vision_tokenizer = Projector(in_features=hidden_size, out_features=768)
# Load LLM components
from transformers import AutoTokenizer, AutoModelForCausalLM
llm_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
llm_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
# Initialize AnyModal
multimodal_model = MultiModalModel(
input_processor=None,
input_encoder=vision_encoder,
input_tokenizer=vision_tokenizer,
language_tokenizer=llm_tokenizer,
language_model=llm_model,
input_start_token='<|imstart|>',
input_end_token='<|imend|>',
prompt_text="The interpretation of the given image is: "
)
Use AnyModal to train and generate predictions:
# Training
for epoch in range(num_epochs):
for batch in train_loader:
optimizer.zero_grad()
logits, loss = multimodal_model(batch)
loss.backward()
optimizer.step()
# Inference
sample_input = val_dataset[0]['input']
generated_text = multimodal_model.generate(sample_input, max_new_tokens=30)
print(generated_text)
You can extend AnyModal by implementing new input processors and tokenizers. For example:
class AudioProcessor:
def __init__(self, sample_rate):
self.sample_rate = sample_rate
def process(self, audio_data):
# Your audio preprocessing logic
pass
Furthormore, you can change the core components of AnyModal to suit your needs. Consider implementing the functionalities like saving and loading models or pushing the saved projectors and LoRAs to the HF hub.
- AnyModal demo for LaTeX OCR
- AnyModal demo for Radiology Captioning
- AnyModal demo for Image Captioning
- AnyModal demo for Visual Question Answering
- AnyModal demo for Audio Captioning / audio + textual instructions
Note that the demos are still in progress, and there is still room for improvement. Do you have any other ideas for AnyModal demos? Feel free to suggest them!
Contributions are highly welcome! Whether it's fixing bugs, improving documentation, or adding support for new input modalities, your help is appreciated. Here's how you can contribute:
- Fork the repository and clone it to your local machine.
- Create a new branch for your feature or bug fix.
- Submit a pull request describing your changes.
Let's build and improve AnyModal together!
Join our subreddit at r/AnyModal to discuss ideas, ask questions, and share your projects using AnyModal.
This project is licensed under the MIT License. Feel free to use and modify as needed.
Happy building with AnyModal!