Skip to content

A Python-based parallel file chunking system designed for processing large codebases into LLM-friendly chunks.

License

Notifications You must be signed in to change notification settings

duriantaco/pykomodo

Repository files navigation

KOMODO Logo

A Python-based parallel file chunking system designed for processing large codebases into LLM-friendly chunks. The tool provides intelligent file filtering, multi-threaded processing, and advanced chunking capabilities optimized for machine learning contexts.

Core Features

  • Parallel Processing: Multi-threaded file reading with configurable thread pools

  • Smart File Filtering:

    • Built-in patterns for common excludes (.git, node_modules, pycache, etc.)
    • Customizable ignore/unignore patterns
    • Intelligent binary file detection
  • Flexible Chunking:

    • Equal-parts chunking: Split content into N equal chunks
    • Size-based chunking: Split by maximum chunk size
    • Semantic (AST-based) chunking for Python files
    • Dry-run mode: If you only want to see which files would be chunked
  • LLM Optimizations:

    • Metadata extraction (functions, classes, imports, docstrings)
    • Content relevance scoring
    • Redundancy removal across chunks
    • Configurable context window sizes
  • NEW Chunking PDF Files:

    • Split PDF content by pages and paragraphs (rather than lines)
    • Perform basic text cleanup to handle multi-column layouts, or text from HTML-like elements if present
    • Create multiple chunks for large PDFs while preserving some logical structure

Installation

pip install komodo==0.1.0

Link to pypi: https://pypi.org/project/pykomodo/

Quick Start

Command Line Usage

Here’s a complete list of available command-line options for the komodo tool:

Option Description Default Value
--version Show the version of komodo N/A
dirs Directories to process (space-separated; e.g., komodo dir1/ dir2/). Current directory (.)
--equal-chunks N Split content into N equal chunks. Mutually exclusive with --max-chunk-size. None
--max-chunk-size M Maximum size per chunk (tokens without --semantic-chunks; lines for .py with it). None
--output-dir DIR Directory where chunk files are saved. "chunks"
--ignore PATTERN Add a pattern to ignore (repeatable, e.g., --ignore "*.log"). None
--unignore PATTERN Add a pattern to unignore (repeatable, overrides ignores). None
--dry-run List files that would be processed without creating chunks. False
--priority PATTERN,SCORE Set priority for file patterns (repeatable, e.g., --priority "*.py,10"). None
--num-threads N Number of threads for parallel processing. 4
--enhanced Use EnhancedParallelChunker for LLM optimizations. False
--semantic-chunks Enable AST-based chunking for .py files (splits by functions/classes). False
--context-window N Target LLM context window size in bytes (used with --enhanced). 4096
--min-relevance F Minimum relevance score for chunks (0.0-1.0, used with --enhanced). 0.3
--no-metadata Disable metadata extraction (used with --enhanced). False (metadata enabled)
--keep-redundant Keep redundant content across chunks (used with --enhanced). False (removes redundancy)
--no-summaries Disable summary generation (used with --enhanced; currently a placeholder in code). False (summaries enabled)
--file-type TYPE Only process files of this extension (e.g., pdf, py). None

Notes:

  • Options like --equal-chunks and --max-chunk-size cannot be used together (enforced by the CLI).
  • Use --dry-run to test your ignore/unignore patterns or priority rules without generating output.

Basic usage

# Split into 5 equal chunks
komodo . --equal-chunks 5

# Process multiple directories
komodo path1/ path2/ --max-chunk-size 1000

Chunking Modes

Komodo offers flexible chunking strategies, with behavior varying based on options and the chunker type (ParallelChunker or EnhancedParallelChunker with --enhanced).

  • Fixed Number of Chunks (--equal-chunks N):

    • Base Chunker: Keeps files whole, distributing them into N chunks with approximately equal total character counts. i.e. 5 different chunks or 5 text files.

      komodo . --equal-chunks 5 --output-dir chunks
    • Enhanced Chunker: Combines all file contents into one text blob, then splits into N chunks of roughly equal byte size, potentially splitting files mid-content.

      komodo . --equal-chunks 5 --enhanced
  • Fixed Size Chunks (--max-chunk-size M): Without --semantic-chunks: Splits each file into chunks with at most M tokens (words), keeping lines whole. i.e. x number of chunks with 2000 tokens each or 5000 tokens each etc.

    komodo . --max-chunk-size 2000

    Important: You must specify either --equal-chunks or --max-chunk-size, but not both.

  • With --semantic-chunks:

  • For .py files: Aims for chunks of M lines, grouping top-level functions/classes as atomic units. If a function exceeds M lines, it becomes a single chunk.

  • For non-.py files: Still splits by M tokens.

    komodo . --max-chunk-size 200 --semantic-chunks
  • PDF Chunking:

    Uses PyMuPDF to split PDFs by pages and paragraphs, respecting --max-chunk-size in tokens.

    komodo . --max-chunk-size 500 /path/to/output --file-type pdf

    or

    komodo . --equal-chunks 10 --output-dir /path/to/output --file-type pdf

    IMPORTANT: Do note that for PDFs with a lot of images, this PDF chunker will NOT WORK. This current PDF chunker is NOT capable of chunking formulas/images

Ignoring & Unignoring Files

  • Add ignore patterns with --ignore.

  • Unignore specific patterns with --unignore.

  • Komodo also has built-in ignores like .git, pycache, node_modules, etc.

    # Skip everything in "results/" (relative) and "docs/" (relative)
    komodo . --equal-chunks 5 \
      --ignore "results/**" \
      --ignore "docs/**"
    
    # Skip an absolute path
    komodo . --equal-chunks 5 \
      --ignore "/Users/oha/komodo/results/**"
    
    # Skip all .rst files, but unignore README.rst
    komodo . --equal-chunks 5 \
      --ignore "*.rst" \
      --unignore "README.rst"
    Safest (Recursive) Ignoring

    If you want to ensure that Komodo skips all files inside a particular directory (including all subfolders), you can use the ** wildcard before and after the folder name:

    # safest mode: skip everything in "results/" and "docs/" recursively
    komodo . --equal-chunks 5 \
      --ignore "**/results/**" \
      --ignore "**/docs/**"

    Pro Tip: If in doubt, just use /folder/ to recursively ignore that folder and everything beneath it. This is the most reliable way to avoid processing unwanted files in subdirectories.

    Fixed Number of Chunks with ignore mode
    • --ignore "/Users/oha/treeline/results/**" tells the chunker to skip any files in that absolute directory path.

    • --ignore "docs/*" tells it to skip any files under a relative folder named docs/.

      komodo . --equal-chunks 5 --ignore "/Users/oha/treeline/results/**" --ignore "docs/*" 
    Priority Rules

    Priority Rules help determine which files should be processed first or given more importance. Files with higher priority scores are processed first

    # With equal chunks, 10 which is .py is higher than 5, so 10 will get processed first
    komodo . \
      --equal-chunks 5 \
      --priority "*.py,10" \ 
      --priority "*.md,5" \
      --output-dir chunks
    
    # Or with max chunk size
    komodo . \
      --max-chunk-size 1000 \
      --priority "*.py,10" \
      --priority "*.md,5" \
      --output-dir chunks

LLM Optimization Options

Enable metadata extraction and content optimization:

komodo . \
  --equal-chunks 5 \
  --enhanced \
  --context-window 4096 \
  --min-relevance 0.3
komodo . \
  --equal-chunks 5 \
  --enhanced \
  --keep-redundant \
  --min-relevance 0.5
komodo . \
  --equal-chunks 5 \
  --enhanced \
  --no-metadata \
  --context-window 8192

** New ** Dry Run

If you only want to see which files would be chunked (and in what priority order), without actually writing any output chunks, you can specify --dry-run. This is especially helpful if you’re testing new ignore/unignore patterns or priority rules. Note again, there will be NO CHUNKING being done. This is just to let you see what files will be chunked.

Example:

## vanilla approach 
komodo . --equal-chunks 5 --dry-run

## with priorities for .py files. these get processed faster. but note this is just a dry run
komodo . --equal-chunks 5 --dry-run \
    --priority "*.py,10" \
    --priority "*.md,5"

No chunks are created. Komodo simply prints the would-be processed files, sorted by priority. This is an easy way to confirm your ignore patterns and see exactly which files the chunker will pick up.

Python API Usage

Basic usage:

from komodo import ParallelChunker

# Split into 5 equal chunks
chunker = ParallelChunker(
    equal_chunks=5,
    output_dir="chunks"
)
chunker.process_directory("path/to/code")

Advanced configuration:

chunker = ParallelChunker(
    equal_chunks=5,  # or max_chunk_size=1000
    
    user_ignore=["*.log", "node_modules/**"],
    user_unignore=["important.log"],
    binary_extensions=["exe", "dll", "so", "bin"],
    
    priority_rules=[
        ("*.py", 10),
        ("*.md", 5),
        ("*.txt", 1)
    ],
    
    output_dir="chunks",
    num_threads=4
)

chunker.process_directories(["src/", "docs/", "tests/"])

Basic configuration with file_type:

import os
from pykomodo.multi_dirs_chunker import ParallelChunker

os.makedirs("/Users/test/komodo/pdf", exist_ok=True)
output_dir = "/Users/test/komodo/pdf"

chunker = ParallelChunker(
    max_chunk_size=1000,
    output_dir=output_dir,
    file_type="pdf" 
)

chunker.process_directory("/Users/test/komodo/")

print("PDF processing completed successfully!")

Advanced LLM Features

Metadata Extraction

Each chunk automatically extracts and includes:

  • Function definitions
  • Class declarations
  • Import statements
  • Docstrings

Relevance Scoring

Chunks are scored based on:

  • Code/comment ratio
  • Function/class density
  • Documentation quality
  • Import significance

Redundancy Removal

Automatically removes duplicate content across chunks while preserving unique context.

Example with LLM optimizations:

chunker = ParallelChunker(
    equal_chunks=5,
    extract_metadata=True,
    remove_redundancy=True,
    context_window=4096,
    min_relevance_score=0.3
)

File Type Restriction

The file_type parameter of the ParallelChunker constructor lets you restrict which file extensions you process.

import os
from pykomodo.multi_dirs_chunker import ParallelChunker

os.makedirs("/path/to/dir", exist_ok=True)
output_dir = "/path/to/dir"

chunker = ParallelChunker(
    max_chunk_size=1000,
    output_dir=output_dir,
    file_type="pdf" 
)

chunker.process_directory("/path/to/dir")

print("PDF processing completed successfully!")

Typed Classes & Pydantic-Based Configuration

Komodo’s main classes (ParallelChunker, EnhancedParallelChunker, etc.) now include type hints. Nothing changes at runtime, but if you’re using an IDE or a type checker like mypy, you’ll get improved error checking and auto-completion - or hopefully.

You can also use Pydantic to configure Komodo with strongly typed settings. For instance:

from pydantic import BaseModel, Field
from typing import List, Optional
from pykomodo.multi_dirs_chunker import ParallelChunker
from pykomodo.enhanced_chunker import EnhancedParallelChunker

class KomodoConfig(BaseModel):
    directories: List[str] = Field(default_factory=lambda: ["."], description="Directories to process.")
    equal_chunks: Optional[int] = None
    max_chunk_size: Optional[int] = None
    output_dir: str = "chunks"
    semantic_chunking: bool = False
    enhanced: bool = False
    context_window: int = 4096
    min_relevance_score: float = 0.3
    remove_redundancy: bool = True
    extract_metadata: bool = True

def run_chunker_with_config(config: KomodoConfig):
    ChunkerClass = EnhancedParallelChunker if config.enhanced else ParallelChunker

    chunker = ChunkerClass(
        equal_chunks=config.equal_chunks,
        max_chunk_size=config.max_chunk_size,
        output_dir=config.output_dir,
        semantic_chunking=config.semantic_chunking,
        context_window=config.context_window if config.enhanced else None,
        min_relevance_score=config.min_relevance_score if config.enhanced else None,
        remove_redundancy=config.remove_redundancy if config.enhanced else None,
        extract_metadata=config.extract_metadata if config.enhanced else None,
    )

    chunker.process_directories(config.directories)
    chunker.close()

if __name__ == "__main__":
    # example use with typed + validated config
    cfg = KomodoConfig(directories=["src/", "docs/"], equal_chunks=5, enhanced=True)
    run_chunker_with_config(cfg)

Common Use Cases

1. Preparing Context for LLMs

Split a large codebase into equal chunks suitable for LLM context windows:

chunker = ParallelChunker(
    equal_chunks=5,
    priority_rules=[
        ("*.py", 10),    
        ("README*", 8), 
    ],
    user_ignore=["tests/**", "**/__pycache__/**"],
    output_dir="llm_chunks"
)
chunker.process_directory("my_project")

Built-in Ignore Patterns

The chunker automatically ignores common non-text and build-related files:

  • **/.git/**
  • **/.idea/**
  • __pycache__
  • *.pyc
  • *.pyo
  • **/node_modules/**
  • target
  • venv

Common Gotchas

  1. Leading Slash for Absolute Paths
  • If you omit the leading / in a pattern like /Users/oha/..., Komodo treats it as relative and won’t match your actual absolute path.
  1. /** vs. /*
  • folder/** matches all files and subfolders under folder.
  • folder/* only matches the immediate contents of folder, not deeper subdirectories.
  • Overwriting Multiple --ignore Flags
  1. Folder Name vs. Actual Path
  • If your path is really src/komodo/content/results, but you only wrote results/**, you may need a double-star approach (**/results/**) to cover deeper paths.

Acknowledgments

This project was inspired by repomix, a repository content chunking tool.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

Apache 2.0

About

A Python-based parallel file chunking system designed for processing large codebases into LLM-friendly chunks.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published