Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🪲 Keyword translation fails for Catalan: "at random" -> "att random" #6118

Closed
rix0rrr opened this issue Jan 21, 2025 · 2 comments · Fixed by #6121
Closed

🪲 Keyword translation fails for Catalan: "at random" -> "att random" #6118

rix0rrr opened this issue Jan 21, 2025 · 2 comments · Fixed by #6121
Assignees
Labels
bug Something isn't working

Comments

@rix0rrr
Copy link
Collaborator

rix0rrr commented Jan 21, 2025

Describe the bug

If we translate the keywords in a program from Catalan to English, but the English keywords at random are used, the translation output is att random.

I think this is because the Catalan translation of this keyword (a) is a prefix of the English translation of this keyword (at).

The following tests reproduces it (put it in test_translation_level_06.py):

    def test_specific_catalan_example(self):
        code = textwrap.dedent("""\
            ruleta1 = ruleta at random
            """)

        result = hedy_translation.translate_keywords(code, from_lang="ca", to_lang="en", level=self.level)

        self.assertEqual(result, 'ruleta1 = ruleta at random')

When I add a print statement to the the innards of translate_keywords:

                replaced_line = replace_token_in_line(line, rule, original, target)
                print(line, '|', replaced_line, '# replacing', rule, original, target)    # <-- debugging

I see the following:

ruleta1 = ruleta at random | ruleta1 = ruleta at random # replacing Rule(keyword='random', line=1, start=20, end=25, value='random') random random
ruleta1 = ruleta at random | ruleta1 = ruleta att random # replacing Rule(keyword='at', line=1, start=16, end=19, value=' at ') a at
  • For the first line: we find the keyword random and replace it with random.
  • For the second line, we see that we think we matched the keyword a and are replacing it with at, whereas we actually should have matched the keyword a.

This is not a pure language confusion bug, as the Catalan keyword for random is aleatori, so we definitely can distinguish whether we matched an English vs a Catalan keyword. This makes me the think the problem is that a is a prefix of at. Maybe it's a regex thing somewhere?

@rix0rrr rix0rrr added the bug Something isn't working label Jan 21, 2025
@rix0rrr
Copy link
Collaborator Author

rix0rrr commented Jan 21, 2025

Oh no, the problem is probably here:

def get_original_keyword(keyword_dict, keyword, line):
    for word in keyword_dict[keyword]:
        if word in line:
            return word

    # If we can't find the keyword, it means that it isn't part of the valid keywords for this language
    # so return original instead
    return keyword

It's not that a is a prefix of at, but it's a proper substring.

@rix0rrr rix0rrr self-assigned this Jan 21, 2025
@rix0rrr
Copy link
Collaborator Author

rix0rrr commented Jan 21, 2025

Keyword replacing on a string basis after we've already parsed is error prone.

Solution: replace based on parse tree, but do substitutions in reverse order so that later string indexes don't shift after a substitution.

rix0rrr added a commit that referenced this issue Jan 22, 2025
In #6118, we discovered that when the source language keywword is a
substring of the target language keyword, we make a mistake in the
string replacement.

This happens because we do the following:

- First: parse the entire program
- Then: for every line containing a keyword, do a separate search and
  replace on the line.

In this PR, I'm changing the logic a little: as part of parsing, we
already know the location in the line where the keyword occurs, so we
can just immediately replace that part of the line with the new keyword.

Two complications that make this slightly less straightforward than it
sounds:

- Our grammar rules match whitespace as part of the keyword token, but
  the whitespace needs to remain in place.
    - Solution: find the whitespace in the matched token and only
      substitute the rest of the token.
- String substitutions may change the length of the string, and
  therefore invalidate all the indexes of the parse tree that follow
  it.
    - Solution: first collect a list of all subsitutions, then apply
      them in back-to-front order. That way, all yet-to-be-processed
      indexes are never disturbed by a substitution.

Fixes #6118.
@mergify mergify bot closed this as completed in #6121 Jan 22, 2025
@mergify mergify bot closed this as completed in f8003e0 Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

1 participant