The pipe is a multimodal-first tool for feeding real-world information into large language models. It is built on top of dozens of carefully-crafted heuristics to create sensible text and image prompts from files, directories, web pages, papers, github repos, etc.
- Prepare prompts from dozens of complex file types π
- Visual document extraction for complex PDFs, markdown, etc π§
- Outputs optimized for multimodal LLMs πΌοΈ + π¬
- Multi-threaded β‘οΈ
- Works with missing file extensions, in-memory data streams πΎ
- Works with directories, URL, git repos, and more π
The pipe is accessible from the command line or from Python. The input source is either a file path, a URL, or a directory (or zip file) path. The pipe will extract information from the source and process it for downstream use with language models, vision transformers, or vision-language models. The output from the pipe is a sensible text-based (or multimodal) representation of the extracted information, carefully crafted to fit within context windows for any models from gemma-7b to GPT-4. It uses a variety of heuristics for optimal performance with vision-language models, including AI filetype detection with filetype detection, AI PDF extraction, efficient token compression, automatic image encoding, reranking for lost-in-the-middle effects, and more, all pre-built to work out-of-the-box.
To use The Pipe, clone the repository and install the requirements:
git clone https://github.com/emcf/thepipe
pip install -r requirements.txt
npm install
npx playwright install --with-deps
Linux users can install ctags with
sudo apt-get install -y universal-ctags
Windows users must ensure ctags.exe is in their PATH environment variable.
To use The Pipe from the command line, simply run
python thepipe.py path/to/directory
This command will process all supported files within the specified directory, compressing any information over the token limit if necessary, and outputting the resulting prompt and images to a folder.
Arguments are:
- The input source (required): can be a file path, a URL, or a directory path.
--match
(optional): Regex pattern to match files in the directory.--ignore
(optional): Regex pattern to ignore files in the directory.--limit
(optional): The token limit for the output prompt, defaults to 100K. Prompts exceeding the limit will be compressed.--mathpix
(optional): Extract images, tables, and math from PDFs using Mathpix.--text_only
(optional): Do not extract images from documents or websites. Additionally, image files will be represented with OCR instead of as images.
To use the pipe from Python with a language model, simply run
import openai
import thepipe
openai_client = openai.OpenAI()
response = openai_client.chat.completions.create(
model="gpt-4-vision-preview",
messages = thepipe.make_prompt_from_source("https://github.com/emcf/thepipe"),
)
You can use the pipe's output with other LLM providers via LiteLLM.
Source Type | Input types | Token Compression ποΈ | Image Extraction ποΈ | Notes π |
---|---|---|---|---|
Directory | Any /path/to/directory |
βοΈ | βοΈ | Extracts from all files in directory, supports match and ignore patterns |
Code | .py , .tsx , .js , .html , .css , .cpp , etc |
βοΈ (varies) | β | Combines all code files. .c , .cpp , .py are compressible with ctags, others are not |
Plaintext | .txt , .md , .rtf , etc |
βοΈ | β | Regular text files |
.pdf |
βοΈ | βοΈ | Extracts text and optionally images; can use Mathpix for enhanced extraction | |
Image | .jpg , .jpeg , .png , .gif , .bmp , .tiff , .webp , .svg |
β | βοΈ | Extracts images and can convert to text using OCR |
Data Table | .csv , .xls , .xlsx , supabase |
βοΈ | β | Extracts data from spreadsheets or SQL tables; converts to text representation. For very large datasets, will only extract column names and types |
Jupyter Notebook | .ipynb |
β | β | Extracts content from Jupyter notebooks |
Microsoft Word Document | .docx |
βοΈ | βοΈ | Extracts text from Word documents |
Microsoft PowerPoint Presentation | .pptx |
βοΈ | βοΈ | Extracts text from PowerPoint presentations |
Website | URLs (http, https, www, ftp) | βοΈ | βοΈ | Extracts content from web pages; text-only extraction available |
GitHub Repository | GitHub repo URLs | βοΈ | βοΈ | Extracts from GitHub repositories; supports branch specification |
ZIP File | .zip |
βοΈ | βοΈ | Extracts contents of ZIP files; supports nested directory extraction |