EfficientARV

This project focuses on the development of an efficient autoregressive model for the joint generation of images and videos. The aim is to enhance the generative capabilities of current multimodal large models, ultimately building an interactive world model that integrates both multimodal understanding and generation.

🔆 New Features/Updates

Trained the first version on a relatively small dataset

TODO list

Implement the training and testing pipeline
Test different AR generation schemes
Jointly train image and video generation model on a larger dataset with larger resolutions
Implement an efficient unstructured image-video tokenizer
Integrate generation capabilities into MLLMs
Support multiple conditional generation tasks: image animation, image/video inpainting/outpainting, video prediction, video interpolation
Support multimodal controllable generation

Gallery


"time lapse of a cloudy sky"	"countryside top view"	"a blue and cloudy sky"	"aerial view of brown dry landscape"


"waterfalls in between mountain"	"view of the amazon river"	"a river waterfall cascading down the plunge basin"	"flooded landscape with palm trees"


"drone shot of an abandoned coliseum on a snowy mountain top"	"clouds over mountain"	"aerial view of road in forest"	"a peaceful lake"

Training

Coming soon

Inference

Coming soon

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
OmniTokenizer		OmniTokenizer
assets		assets
videoAR		videoAR
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EfficientARV

🔆 New Features/Updates

TODO list

Gallery

Training

Inference

About

Releases

Packages

Languages

Everlyn-Labs/EfficientARV

Folders and files

Latest commit

History

Repository files navigation

EfficientARV

🔆 New Features/Updates

TODO list

Gallery

Training

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages