Skip to content
forked from emcf/thepipe

Feed real-world data into large language models 🚰🧠

Notifications You must be signed in to change notification settings

ibehnam/_thepipe

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

61 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

codecov python-gh-action

The pipe is a multimodal-first tool for feeding real-world information into large language models. It is built on top of dozens of carefully-crafted heuristics to create sensible text and image prompts from files, directories, web pages, papers, github repos, etc.

Demo

Features 🌟

  • Prepare prompts from dozens of complex file types πŸ“„
  • Visual document extraction for complex PDFs, markdown, etc 🧠
  • Outputs optimized for multimodal LLMs πŸ–ΌοΈ + πŸ’¬
  • Multi-threaded ⚑️
  • Works with missing file extensions, in-memory data streams πŸ’Ύ
  • Works with directories, URL, git repos, and more 🌐

How it works πŸ› οΈ

The pipe is accessible from the command line or from Python. The input source is either a file path, a URL, or a directory (or zip file) path. The pipe will extract information from the source and process it for downstream use with language models, vision transformers, or vision-language models. The output from the pipe is a sensible text-based (or multimodal) representation of the extracted information, carefully crafted to fit within context windows for any models from gemma-7b to GPT-4. It uses a variety of heuristics for optimal performance with vision-language models, including AI filetype detection with filetype detection, AI PDF extraction, efficient token compression, automatic image encoding, reranking for lost-in-the-middle effects, and more, all pre-built to work out-of-the-box.

Getting Started πŸš€

To use The Pipe, clone the repository and install the requirements:

git clone https://github.com/emcf/thepipe
pip install -r requirements.txt
npm install
npx playwright install --with-deps

Linux users can install ctags with

sudo apt-get install -y universal-ctags

Windows users must ensure ctags.exe is in their PATH environment variable.

To use The Pipe from the command line, simply run

python thepipe.py path/to/directory

This command will process all supported files within the specified directory, compressing any information over the token limit if necessary, and outputting the resulting prompt and images to a folder.

Arguments are:

  • The input source (required): can be a file path, a URL, or a directory path.
  • --match (optional): Regex pattern to match files in the directory.
  • --ignore (optional): Regex pattern to ignore files in the directory.
  • --limit (optional): The token limit for the output prompt, defaults to 100K. Prompts exceeding the limit will be compressed.
  • --mathpix (optional): Extract images, tables, and math from PDFs using Mathpix.
  • --text_only (optional): Do not extract images from documents or websites. Additionally, image files will be represented with OCR instead of as images.

To use the pipe from Python with a language model, simply run

import openai
import thepipe
openai_client = openai.OpenAI()
response = openai_client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages = thepipe.make_prompt_from_source("https://github.com/emcf/thepipe"),
)

You can use the pipe's output with other LLM providers via LiteLLM.

Supported File Types πŸ“š

Source Type Input types Token Compression πŸ—œοΈ Image Extraction πŸ‘οΈ Notes πŸ“Œ
Directory Any /path/to/directory βœ”οΈ βœ”οΈ Extracts from all files in directory, supports match and ignore patterns
Code .py, .tsx, .js, .html, .css, .cpp, etc βœ”οΈ (varies) ❌ Combines all code files. .c, .cpp, .py are compressible with ctags, others are not
Plaintext .txt, .md, .rtf, etc βœ”οΈ ❌ Regular text files
PDF .pdf βœ”οΈ βœ”οΈ Extracts text and optionally images; can use Mathpix for enhanced extraction
Image .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg ❌ βœ”οΈ Extracts images and can convert to text using OCR
Data Table .csv, .xls, .xlsx, supabase βœ”οΈ ❌ Extracts data from spreadsheets or SQL tables; converts to text representation. For very large datasets, will only extract column names and types
Jupyter Notebook .ipynb ❌ ❌ Extracts content from Jupyter notebooks
Microsoft Word Document .docx βœ”οΈ βœ”οΈ Extracts text from Word documents
Microsoft PowerPoint Presentation .pptx βœ”οΈ βœ”οΈ Extracts text from PowerPoint presentations
Website URLs (http, https, www, ftp) βœ”οΈ βœ”οΈ Extracts content from web pages; text-only extraction available
GitHub Repository GitHub repo URLs βœ”οΈ βœ”οΈ Extracts from GitHub repositories; supports branch specification
ZIP File .zip βœ”οΈ βœ”οΈ Extracts contents of ZIP files; supports nested directory extraction

About

Feed real-world data into large language models 🚰🧠

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.4%
  • TypeScript 1.3%
  • CSS 0.5%
  • C++ 0.3%
  • C 0.2%
  • Batchfile 0.2%
  • Jupyter Notebook 0.1%