update docs

seapagan · Jan 26, 2025 · 3b84a01 · 3b84a01
1 parent ab02023
commit 3b84a01
Show file tree

Hide file tree

Showing 3 changed files with 80 additions and 56 deletions.
diff --git a/README-cratesio.md b/README-cratesio.md
@@ -9,33 +9,33 @@ consumption, code analysis, and repository review.
 XML was chosen for the file output format since it is very well structured and
 LLM models can easily parse it (better than a plain-text dump).
 
-It is inspired by [Repopack](#acknowledgements) which is a great tool, but is
+It is inspired by [Repomix](#acknowledgements) which is a great tool, but is
 written in TypeScript and needs a Node.js environment to run. Eventually this
 project will produce binaries and not need Rust installed to run.
 
-The generated XML metadata and structure are inspired by the output of Repopack
+The generated XML metadata and structure are inspired by the output of Repomix
 (a lot of the header text was taken from there), with enhancements that include
 additional file attributes, instructions for the LLM and a more robust
 structure. At this time `xml` output is the only supported output format,
 however future versions may include additional formats.
 
-> XML was chosen as the default output format since it is very well structured
-> and LLM models can easily parse it (better than a plain-text dump - see this
-> [link][why-xml] from Anthropic as to why XML is a superior format for feeding
-> context and instructions into an LLM).
+XML was chosen as the default output format since it is very well structured
+and LLM models can easily parse it (better than a plain-text dump - see this
+[link][why-xml] from Anthropic as to why XML is a superior format for feeding
+context and instructions into an LLM).
 
 ```pre
-BundleRepo Version 0.1.0, © 2024-2025 Grant Ramsay <[email protected]>
+BundleRepo Version 0.3.0, © 2024-2025 Grant Ramsay <[email protected]>
 
 Pack a local or remote Git Repository to XML for LLM Consumption.
 
--> Found a git repository in the current directory: '/home/seapagan/data/work/own/bundle-repo' (branch: main)
--> Successfully wrote XML to packed-repo.xml
+-> Found a git repository in the current directory: '/home/seapagan/data/work/own/bundle-repo' (branch: add-config-file)
+-> Successfully wrote XML to 'packed-repo.xml'
 
 Summary:
-     Total Files processed:  11
- Total output size (bytes):  47906
-      Token count (GPT-4o):  11344
+     Total Files processed:  13
+ Total output size (bytes):  79068
+      Token count (GPT-4o):  18766
 ```
 
 - [Compatibility](#compatibility)
@@ -109,29 +109,31 @@ build the project.
 
 ### Installation
 
-1. Clone the project and install dependencies.
+Clone the project and install dependencies.
 
-   - From [crates.io][crates-io-page]:
+- From [crates.io][crates-io-page]:
 
-     ```bash
-     cargo install bundle_repo
-     ```
+  ```bash
+  cargo install bundle_repo
+  ```
 
-   - From source:
+  The DeepSeek tokenizer file is embedded in the binary, so no additional setup is required.
 
-     ```bash
-     git clone https://github.com/seapagan/bundle-repo.git
-     cd bundle-repo
-     cargo build --release
-     ```
+- From source:
 
-     Move the binary to a directory in your `PATH`:
+  ```bash
+  git clone https://github.com/seapagan/bundle-repo.git
+  cd bundle-repo
+  cargo build --release
+  ```
 
-     eg for Linux or MacOS:
+  Move the binary to a directory in your `PATH`:
 
-     ```bash
-     sudo mv ./target/release/bundlerepo /usr/local/bin
-     ```
+  eg for Linux or MacOS:
+
+  ```bash
+  sudo mv ./target/release/bundlerepo /usr/local/bin
+  ```
 
 ### Running the Tool
 
@@ -311,6 +313,8 @@ Options:
   -t, --token <TOKEN>       GitHub personal access token (required for private repos and to pass rate limits)
   -e, --extend-exclude <PATTERN>  Additional file pattern to exclude (can be specified multiple times)
   -x, --exclude <PATTERN>   File pattern to exclude, replacing the default ignore list (can be specified multiple times)
+  -u, --utf8                Force UTF-8 encoding for all text files
+  -U, --no-utf8             Disable UTF-8 encoding for text files (overrides --utf8)
   -V, --version             Print version information and exit
   -h, --help                Print help
 ```
@@ -337,6 +341,8 @@ clipboard = false
 line_numbers = true
 token = "your-github-token"
 extend_exclude = ["*.md", "*.txt", "docs/*"]  # Additional patterns to exclude
+exclude = ["*.exe", "*.dll", "node_modules/*"]  # File patterns to exclude
+utf8 = true  # Force UTF-8 encoding for all text files
 ```
 
 All settings are optional. Settings are applied in the following order of
@@ -358,6 +364,7 @@ Available configuration options:
 - `extend_exclude`: Additional file patterns to exclude (default: none)
 - `exclude`: File patterns to exclude, replacing the default ignore list
   (default: none)
+- `utf8`: Whether to force UTF-8 encoding for all text files (default: false)
 
 The `extend_exclude` and `exclude` options can be specified either by using
 multiple `-e` or `-x` flags on the command line:
@@ -392,6 +399,11 @@ Storing your GitHub token in the configuration file can be more convenient than
 passing it via command line, especially if you frequently work with private
 repositories. Just be sure to keep your configuration file secure.
 
+The UTF-8 encoding feature (`--utf8` flag or `utf8 = true` in config) ensures all text files
+are encoded in UTF-8 before being included in the XML output. This is useful when working
+with files that may use different encodings, ensuring compatibility with LLMs and other tools.
+You can disable this with `--no-utf8` even if it's enabled in the config file.
+
 ## Ignored Files
 
 The tool will ignore the following files by default and (except for binary, see
@@ -406,7 +418,9 @@ below) they will not be listed anywhere in the XML output:
 - Python requirements files (`requirements.txt`, `requirements-dev.txt`, etc)
 - Lockfiles - any file ending in `.lock`
 - `renovate.json`
-- `license` files (e.g. `LICENSE`, `LICENSE.md`, etc)
+- `license` files (e.g. `LICENSE`, `LICENSE.md`, etc). Also matches the
+  alternate 'Licence' spelling.
+- `.vscode` folder and it's contents
 
 This list is hard-coded (and to be honest is tuned to my current workflow) and
 cannot be changed at this time. However, that will be changed once the
@@ -468,13 +482,13 @@ This tool is currently in **beta**. While the core functionality works, there
 may be edge cases or features yet to be fully refined. Feedback and
 contributions are welcome to improve and stabilize the tool.
 
-There is a pressing need for a test suite to ensure the tool works as expected
-in a variety of scenarios. This is a priority for the next release.
+There is a pressing need to improve the test suite to ensure the tool works as
+expected in a variety of scenarios. This is a priority for the next release.
 
 ## Acknowledgements
 
-**Bundle Repo** is a rewrite of the original
-[Repopack](https://github.com/yamadashy/repopack) project, though none of the
+**Bundle Repo** is a rewrite from scratch of the original [Repomix (formerly
+'repopack)](https://github.com/yamadashy/repomix) project, though none of the
 source code was used or even looked at (the output file header however was
 heavily borrowed from). The idea was to create a similar tool from scratch, with
 a few enhancements and improvements. It's also part of my journey to learn Rust

diff --git a/README.md b/README.md
@@ -9,11 +9,11 @@ consumption, code analysis, and repository review.
 XML was chosen for the file output format since it is very well structured and
 LLM models can easily parse it (better than a plain-text dump).
 
-It is inspired by [Repopack](#acknowledgements) which is a great tool, but is
+It is inspired by [Repomix](#acknowledgements) which is a great tool, but is
 written in TypeScript and needs a Node.js environment to run. Eventually this
 project will produce binaries and not need Rust installed to run.
 
-The generated XML metadata and structure are inspired by the output of Repopack
+The generated XML metadata and structure are inspired by the output of Repomix
 (a lot of the header text was taken from there), with enhancements that include
 additional file attributes, instructions for the LLM and a more robust
 structure. At this time `xml` output is the only supported output format,
@@ -318,17 +318,19 @@ Arguments:
   [REPO]  GitHub repository to clone (e.g. 'user/repo' or full GitHub URL). If not provided, the current directory will be searched for a Git repository.
 
 Options:
-  -b, --branch <BRANCH>     Specify a branch to checkout for remote repositories
-  -f, --file <OUTPUT_FILE>  Filename to save the bundle as. [default: packed-repo.xml]
-  -s, --stdout              Output the XML directly to stdout without creating a file.
-  -m, --model <MODEL>       Model to use for tokenization. Supported models: 'gpt4o', 'gpt4', 'gpt3.5', 'gpt3', 'gpt2', 'deepseek' [default: gpt4o]
-  -c, --clipboard           Copy the XML to the clipboard after creating it.
-  -l, --lnumbers           Add line numbers to each code file in the output.
-  -t, --token <TOKEN>       GitHub personal access token (required for private repos and to pass rate limits)
+  -b, --branch <BRANCH>           Specify a branch to checkout for remote repositories
+  -f, --file <OUTPUT_FILE>        Filename to save the bundle as. [default: packed-repo.xml]
+  -s, --stdout                    Output the XML directly to stdout without creating a file.
+  -m, --model <MODEL>             Model to use for tokenization. Supported models: 'gpt4o', 'gpt4', 'gpt3.5', 'gpt3', 'gpt2', 'deepseek' [default: gpt4o]
+  -c, --clipboard                 Copy the XML to the clipboard after creating it.
+  -l, --lnumbers                  Add line numbers to each code file in the output.
+  -t, --token <TOKEN>             GitHub personal access token (required for private repos and to pass rate limits)
   -e, --extend-exclude <PATTERN>  Additional file pattern to exclude (can be specified multiple times)
-  -x, --exclude <PATTERN>   File pattern to exclude, replacing the default ignore list (can be specified multiple times)
-  -V, --version            Print version information and exit
-  -h, --help               Print help
+  -x, --exclude <PATTERN>         File pattern to exclude, replacing the default ignore list (can be specified multiple times)
+  -u, --utf8                      Force UTF-8 encoding for all text files
+  -U, --no-utf8                   Disable UTF-8 encoding for text files (overrides --utf8)
+  -V, --version                   Print version information and exit
+  -h, --help                      Print help
 ```
 
 ## Configuration File
@@ -354,6 +356,7 @@ line_numbers = true
 token = "your-github-token"
 extend_exclude = ["*.md", "*.txt", "docs/*"]  # Additional patterns to exclude
 exclude = ["*.exe", "*.dll", "node_modules/*"]  # File patterns to exclude
+utf8 = true  # Force UTF-8 encoding for all text files
 ```
 
 All settings are optional. Settings are applied in the following order of
@@ -375,6 +378,7 @@ Available configuration options:
 - `extend_exclude`: Additional file patterns to exclude (default: none)
 - `exclude`: File patterns to exclude, replacing the default ignore list
   (default: none)
+- `utf8`: Whether to force UTF-8 encoding for all text files (default: false)
 
 The `extend_exclude` and `exclude` options can be specified either by using
 multiple `-e` or `-x` flags on the command line:
@@ -412,6 +416,13 @@ while the `exclude` patterns will **replace** the default ignore list entirely.
 > than passing it via command line, especially if you frequently work with
 > private repositories. Just be sure to keep your configuration file secure.
 
+> [!NOTE]
+>
+> The UTF-8 encoding feature (`--utf8` flag or `utf8 = true` in config) ensures all text files
+> are encoded in UTF-8 before being included in the XML output. This is useful when working
+> with files that may use different encodings, ensuring compatibility with LLMs and other tools.
+> You can disable this with `--no-utf8` even if it's enabled in the config file.
+
 ## Ignored Files
 
 The tool will ignore the following files by default and (except for binary, see
@@ -426,7 +437,8 @@ below) they will not be listed anywhere in the XML output:
 - Python requirements files (`requirements.txt`, `requirements-dev.txt`, etc)
 - Lockfiles - any file ending in `.lock`
 - `renovate.json`
-- `license` files (e.g. `LICENSE`, `LICENSE.md`, etc)
+- `license` files (e.g. `LICENSE`, `LICENSE.md`, etc). Also matches the
+  alternate 'Licence' spelling.
 - `.vscode` folder and it's contents
 
 This list is hard-coded (and to be honest is tuned to my current workflow) and
@@ -493,13 +505,13 @@ understood by an LLM. Below is an example layout with explanations for each tag:
 > may be edge cases or features yet to be fully refined. Feedback and
 > contributions are welcome to improve and stabilize the tool.
 >
-> There is a pressing need for a test suite to ensure the tool works as expected
-> in a variety of scenarios. This is a priority for the next release.
+> There is a pressing need to improve the test suite to ensure the tool works as
+> expected in a variety of scenarios. This is a priority for the next release.
 
 ## Acknowledgements
 
-**Bundle Repo** is a rewrite of the original
-[Repopack](https://github.com/yamadashy/repopack) project, though none of the
+**Bundle Repo** is a rewrite from scratch of the original [Repomix (formerly
+'repopack)](https://github.com/yamadashy/repomix) project, though none of the
 source code was used or even looked at (the output file header however was
 heavily borrowed from). The idea was to create a similar tool from scratch, with
 a few enhancements and improvements. It's also part of my journey to learn Rust

diff --git a/TODO.md b/TODO.md
@@ -1,7 +1,7 @@
 # Planned Improvements
 
 - add more output formats - Text, Markdown, maybe others.
-- add a test suite to ensure the tool works as expected in a variety of
+- improve the test suite to ensure the tool works as expected in a variety of
   scenarios.
 - allow individual files that are excluded by default to be included without
   wiping the default exclude set as `exclude` currently does.
@@ -19,13 +19,11 @@
   all 3 at this current code state, but we need to develop a test suite and get
   the CI pipeline working to ensure that it continues to work on all 3.
 - allow to work with non-git repositories (local only obviously).
-- add support for additional tokenizers (Claude, Gemini) when/if their specifications are publicly released
-- change file encoding to UTF-8 for included files, this is to ensure that the
-  XML file is valid and can be read by other tools and specifically LLM's who
-  generally prefer UTF-8.
+- add support for additional tokenizers (Claude, Gemini etc) when/if their
+  specifications are publicly released
 - allow user to add custom metadata to the XML file, this could be used to
   store information about the repository, such as the name, description, extra
-  instructions, etc. Would again be once the TOML file is implemented.
+  instructions, etc. Would use the TOML config file.
 - ignore `dotfiles` by default, but allow the user to include them if they want.
 - Add secret-checking to the tool, to ensure that no secrets are included in the
   output XML file. Hopefully this can be done with a library, but may need to