Skip to content

Implementation of "PaLM-E: An Embodied Multimodal Language Model"

License

Notifications You must be signed in to change notification settings

Tariq-Abuhashim/PALM-E

Repository files navigation

Multi-Modality

🌴 PALM-E: A Multi-Modal AI Model

This is the open source implementation of the SOTA multi-modality foundation model "PALM-E: An Embodied Multimodal Language Model" from Google, PALM-E is a single large embodied multimodal model, that can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains.

GitHub issues GitHub forks GitHub stars GitHub license Share on Twitter Share on Facebook Share on LinkedIn Discord Share on Reddit Share on Hacker News Share on Pinterest Share on WhatsApp

Appreciation

  • All the creators in Agora, Join Agora the community of AI engineers changing the world with their creations.
  • LucidRains for inspiring me to devote myself to open source AI

🚀 Quick Start

Installation 📦

pip install palme

Usage 🎨

import torch
from palme.model import PalmE

#usage
img = torch.randn(1, 3, 256, 256)
text = torch.randint(0, 20000, (1, 1024))

model = PalmE()
output = model(text, img)

Model Architecture

model architecture

user input VISION -> vit -> embed -> LLM decoder user input LAN -> embed -> llm decoder

PaLM-E uses a pre-trained language model to process sensor data and generate text.

It converts sensor data, such as images, into a representation similar to how words are processed in a language model.

Language models represent text mathematically by dividing it into tokens, which are associated with high-dimensional vectors.

The model uses mathematical operations on these vectors to predict the next word token.

PaLM-E takes inputs in the form of "multimodal sentences," which can include text and other modalities like images or robot states.

It generates text output based on these inputs, which can be in the form of answers to questions or sequences of decisions.


Dataset Strategy

Here is a summary table of the key datasets mentioned in the paper:

Dataset Tasks Size Link
TAMP Robotic manipulation planning, VQA 96,000 scenes Custom dataset
Language Table Robotic manipulation planning Custom dataset Link
Mobile Manipulation Robotic navigation and manipulation planning, VQA 2912 sequences Based on SayCan dataset
WebLI Image-text retrieval 66M image-caption pairs Link
VQAv2 Visual question answering 1.1M questions on COCO images Link
OK-VQA Visual question answering requiring external knowledge 14,031 questions on COCO images Link
COCO Image captioning 330K images with captions Link
Wikipedia Text corpus N/A Link

The key robotics datasets were collected specifically for this work, while the larger vision-language datasets (WebLI, VQAv2, OK-VQA, COCO) are standard benchmarks in that field. The datasets range from tens of thousands of examples for the robotics domains to tens of millions for the internet-scale vision-language data.


Contribute || Be Part of the PALM-E Adventure 🤝

Your brilliance is needed! Join us, and together, let's make PALM-E even more awe-inspiring:

  1. Get Your Copy: Fork the PALM-E repo.
  2. Make It Local: Clone your fork.
  3. Prep Your Tools: Install the necessities.
  4. Discover & Innovate: Dive into the code.
  5. Craft Your Magic: Branch and code away.
  6. Show & Tell: Push your changes and craft a pull request.

🐞 Fixes, 🎨 enhancements, 📝 docs, or 💡 ideas – all are welcome! Let's shape the future of AI, hand in hand.


Roadmap

  • URGENT: Debug Tokenizer, make sure multi-modal inputs work.
  • Create Dataset Strategy
  • Upload Training Documentation
  • Get Training running with multi-modal

Citation

@article{driess2023palme,
  title={PALM-E: An Embodied Multimodal Language Model},
  author={Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and Huang, Wenlong and Chebotar, Yevgen and Sermanet, Pierre and Duckworth, Daniel and Levine, Sergey and Vanhoucke, Vincent and Hausman, Karol and Toussaint, Marc and Greff, Klaus and Zeng, Andy and Mordatch, Igor and Florence, Pete},
  journal={arXiv preprint arXiv:2303.03378},
  year={2023},
  url={https://doi.org/10.48550/arXiv.2303.03378}
}

About

Implementation of "PaLM-E: An Embodied Multimodal Language Model"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%