-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/add text splitter #256
base: main
Are you sure you want to change the base?
Feat/add text splitter #256
Conversation
…heritance separate base from character text splitter
Added initial module doc stub
I think it's looking good. I especially appreciate the attention to matching the equivalent Python tests. I added a stub for the module doc. Yes, I think a recursive splitter is ultimately where it would need to go. But I like it! Thank you for putting in the effort to bring this feature. Another point we would want in the documentation is a sample of "how would I actually use this?" If a Livebook is more appropriate for that, then it could make sense. But that's for the future. |
Hey. I am truly sorry, but I messed the pull request last night. I should have not done late night pull requesting. I had missed a whole file. I've fixed that and the docs. I truly need a Python, so I will work I finish up some docs and usage examples. Then I'll work on recursive splitter. |
I've written docs and provided some usage examples for the |
@JoaquinIglesiasTurina Do you feel this is ready to merge in now and you will continue your work on the RecursiveSplitter? Or are you intending to add that work to this PR? I want to make sure I understand if you're waiting for me or if I'm waiting for you. 🙂 |
I've started working on recursive text splitter.
I believe this PR to be completed, as it provides a new piece of
functionality.
I would love to have it closed.
But I am happy to add recursive text splitter on this PR if you rather wait.
Thank you for the push.
…On Fri, Feb 21, 2025, 8:18 PM Mark Ericksen ***@***.***> wrote:
@JoaquinIglesiasTurina <https://github.com/JoaquinIglesiasTurina> Do you
feel this is ready to merge in now and you will continue your work on the
RecursiveSplitter? Or are you intending to add that work to this PR?
I want to make sure I understand if you're waiting for me or if I'm
waiting for you. 🙂
—
Reply to this email directly, view it on GitHub
<#256 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZINVOGMYBPF2YNJTJRACT2Q53ZLAVCNFSM6AAAAABXDJLQTSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZVGM2TIMJSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: brainlid]*brainlid* left a comment (brainlid/langchain#256)
<#256 (comment)>
@JoaquinIglesiasTurina <https://github.com/JoaquinIglesiasTurina> Do you
feel this is ready to merge in now and you will continue your work on the
RecursiveSplitter? Or are you intending to add that work to this PR?
I want to make sure I understand if you're waiting for me or if I'm
waiting for you. 🙂
—
Reply to this email directly, view it on GitHub
<#256 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZINVOGMYBPF2YNJTJRACT2Q53ZLAVCNFSM6AAAAABXDJLQTSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZVGM2TIMJSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
If it's okay, I'd like to have the recursive splitter with this. As I understand it, the recursive splitter is really the thing most needed/used. So it feel like it's "complete" with that. |
I've ported CharacterTextSplitter from the Python lib. I've set up the code so RecursiveCharacterTextSplitter can be added easily.
It's still missing documentation, but the code is ready for review.
The differences with the Python implementation are:
"\n\n"
while this implementation's is" "
. Due to this ecto issue, if" "
is used as a separator, the separator becomesnil
after changeset validation. Only by setting the default to" "
, makes" "
a feasible separator. This also makes porting the tests easy.keep_separator
option in Python is:bool
,"start"
or"end"
. I've implemented with options:start
,:end
ornil
. I believe this makes things more explicit without loss of functionality. Python'sTrue
and"start"
map to this:start
. Python'sFalse
maps to thisnil
. Python's"end"
maps to this:end
.All relevant tests from the Python lib have been ported. Up to here
I believe there is an issue when it comes to implement
RecursiveCharacterTextSplitter
, as it shares a lot of options withCharacterSplitter
. I have not found an way to share the overlapping options in the embedded Ecto schema. This will lead to some code duplication. I do not believe this to be a significant issue, as it's not user facing. However, I am happy to hear suggestions.