Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/add text splitter #256

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

JoaquinIglesiasTurina
Copy link
Contributor

I've ported CharacterTextSplitter from the Python lib. I've set up the code so RecursiveCharacterTextSplitter can be added easily.

It's still missing documentation, but the code is ready for review.

The differences with the Python implementation are:

  • Default Python separator is "\n\n" while this implementation's is " ". Due to this ecto issue, if " " is used as a separator, the separator becomes nil after changeset validation. Only by setting the default to " ", makes " " a feasible separator. This also makes porting the tests easy.
  • keep_separator option in Python is: bool, "start" or "end". I've implemented with options :start, :end or nil. I believe this makes things more explicit without loss of functionality. Python's True and "start" map to this :start. Python's False maps to this nil. Python's "end" maps to this :end.

All relevant tests from the Python lib have been ported. Up to here

I believe there is an issue when it comes to implement RecursiveCharacterTextSplitter, as it shares a lot of options with CharacterSplitter. I have not found an way to share the overlapping options in the embedded Ecto schema. This will lead to some code duplication. I do not believe this to be a significant issue, as it's not user facing. However, I am happy to hear suggestions.

@brainlid
Copy link
Owner

I think it's looking good. I especially appreciate the attention to matching the equivalent Python tests.

I added a stub for the module doc.

Yes, I think a recursive splitter is ultimately where it would need to go. But I like it! Thank you for putting in the effort to bring this feature.

Another point we would want in the documentation is a sample of "how would I actually use this?" If a Livebook is more appropriate for that, then it could make sense. But that's for the future.

@JoaquinIglesiasTurina
Copy link
Contributor Author

Hey. I am truly sorry, but I messed the pull request last night. I should have not done late night pull requesting. I had missed a whole file. I've fixed that and the docs.

I truly need a Python, so I will work I finish up some docs and usage examples. Then I'll work on recursive splitter.
But the CharacterSplitter is about half of the needed work for the RecursiveSplitter.

@JoaquinIglesiasTurina
Copy link
Contributor Author

I've written docs and provided some usage examples for the CharacterTextSplitter.

@brainlid
Copy link
Owner

@JoaquinIglesiasTurina Do you feel this is ready to merge in now and you will continue your work on the RecursiveSplitter? Or are you intending to add that work to this PR?

I want to make sure I understand if you're waiting for me or if I'm waiting for you. 🙂

@JoaquinIglesiasTurina
Copy link
Contributor Author

JoaquinIglesiasTurina commented Feb 21, 2025 via email

@brainlid
Copy link
Owner

If it's okay, I'd like to have the recursive splitter with this. As I understand it, the recursive splitter is really the thing most needed/used. So it feel like it's "complete" with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants