-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🪲 Keyword translation fails for Catalan: "at random" -> "att random" #6118
Labels
bug
Something isn't working
Comments
Oh no, the problem is probably here: def get_original_keyword(keyword_dict, keyword, line):
for word in keyword_dict[keyword]:
if word in line:
return word
# If we can't find the keyword, it means that it isn't part of the valid keywords for this language
# so return original instead
return keyword It's not that |
Keyword replacing on a string basis after we've already parsed is error prone. Solution: replace based on parse tree, but do substitutions in reverse order so that later string indexes don't shift after a substitution. |
rix0rrr
added a commit
that referenced
this issue
Jan 22, 2025
In #6118, we discovered that when the source language keywword is a substring of the target language keyword, we make a mistake in the string replacement. This happens because we do the following: - First: parse the entire program - Then: for every line containing a keyword, do a separate search and replace on the line. In this PR, I'm changing the logic a little: as part of parsing, we already know the location in the line where the keyword occurs, so we can just immediately replace that part of the line with the new keyword. Two complications that make this slightly less straightforward than it sounds: - Our grammar rules match whitespace as part of the keyword token, but the whitespace needs to remain in place. - Solution: find the whitespace in the matched token and only substitute the rest of the token. - String substitutions may change the length of the string, and therefore invalidate all the indexes of the parse tree that follow it. - Solution: first collect a list of all subsitutions, then apply them in back-to-front order. That way, all yet-to-be-processed indexes are never disturbed by a substitution. Fixes #6118.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
If we translate the keywords in a program from Catalan to English, but the English keywords
at random
are used, the translation output isatt random
.I think this is because the Catalan translation of this keyword (
a
) is a prefix of the English translation of this keyword (at
).The following tests reproduces it (put it in
test_translation_level_06.py
):When I add a
print
statement to the the innards oftranslate_keywords
:I see the following:
random
and replace it withrandom
.a
and are replacing it withat
, whereas we actually should have matched the keyworda
.This is not a pure language confusion bug, as the Catalan keyword for
random
isaleatori
, so we definitely can distinguish whether we matched an English vs a Catalan keyword. This makes me the think the problem is thata
is a prefix ofat
. Maybe it's a regex thing somewhere?The text was updated successfully, but these errors were encountered: