Simple and multifaceted API for AI
- Python APIs for using large language models, text-to-speech, and Stable Diffusion similarly in your projects
- HTML/Javascript chat interface with image generation and PDF reading abilities, code blocks, chat history, and more
- Gradio interface for experimenting with various features
- Windows or Linux, WSL is supported (recommended, even)
- 12GB VRAM (RTX3060 or Ti4060 recommended)
- 32GB RAM (64GB recommended)
- Python 3.10 (may work on 3.11, file an issue if you have any)
- CUDA 12.3 Toolkit
Your mileage may vary. If you have a lot of CPU RAM, many features will still work (slowly and/or with lower resolution etc).
- Large language model using Exllamav2 (Llama 3.1 8b by default, other options available)
- Vision: YOLOS, Moondream, Owl, LLaVA, DepthAnything, Midas, Canny, and more
- Speech dictation using Whisper
- Image Generation: (SD1.5, SDXL, SD3, Turbo, Lightning, Cascade, IC Relight, Flux, and more)
- Video: Stable Video Diffusion XT, LivePortrait, AnimateLCM with multiple modes available
- Audio: MusicGen, AudioGen, MMAudio
- Text-to-speech: XTTS with instant voice cloning from 6-20sec samples, edge TTS api also included
- Canny and depth detection with text-to-image IP adapter support
- 3D model generation: Shap-E, TripoSR, LGM Mini
- Endpoints with combinations of features to automate workflow
- Easy plugin system that copilot understands (write plugins for new HF models in minutes or seconds) ... and much more!
Yes! Models and other resources are downloaded automatically. This project aims to fully to utilize the Hugging Face cache system.
I just wanted a unified python API for LLM/TTS and possibly even generating simple images. Too many projects require complicated setups, Docker, etc. Many have also become stale or obsolete as huggingface has generously provided improved APIs and examples. Mainly I wanted something simple enough to modify for my exact needs in any scenario without a huge learning curve. I tried to leave everything accessible enough for you to do the same.
- Do what I personally need for my projects (I hope it serves you too!)
- No complicated installation steps
- Something ready to use, fine-tuned right out of the box
(Note: Some of this is temporary until I decide on a proper way of handling settings.)
A working run.bat is included for reference, but feel free to use your environment of choice (conda, WSL, etc).
The following API endpoints are available (please note that this is not a complete list as new features are being added constantly):
/img/canny
/img/depth
/img/depth/midas
/img/rembg
/vid2densepose
/txt2img
/img2img
/inpaint
/txt2img/flux
/txt2img/canny
/txt2img/depth
/txt2img/openpose
/txt2img/relight
/txt2img/instantid
/txt2img/cascade
/txt2img/controlnet
/txt2model/shape
/img2model/lgm
/img2model/tsr
/img2vid/xt
/txt2vid/animate
/txt2vid/zero
/txt2vid/zeroscope
/img2vid/liveportrait
/detect/yolos
/vision
/img2txt/llava
/txt2wav/musicgen
/mmaudio
/piano2midi
/chat/completions
/chat/stream
/txt/summary
/txt/profile
/youtube/download
/youtube/captions
/youtube/grid
/youtube/frames
/reddit/download
/tts
/google/trends
Add wav files containing samples of the voices you want to use into the voices/
folder. A single example female1.wav
is included. The voice
parameter of the tts API expects the name of the file (without .wav on the end). There is no training required!