-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libXML errors in phpbb #108
Comments
Thanks for the report, I appreciate your diligence. Let me make sure I understand everything. Some of your posts contain invalid UTF-8 which, if I remember correctly, causes a fatal error during rendering. Your workaround is to re-encode the XML to make it valid. The mojibake remains but at least it's valid UTF-8 and the page doesn't abruptly dies. Is that correct? In that case, do you know what caused that text to contain invalid UTF-8? The library expects valid UTF-8 and does not validate the input so given invalid input, it's going to produce invalid output. That much is expected but I will look into potential workarounds such as rejecting invalid UTF-8 or replacing invalid characters, because if the program flow has to be interrupted I'd rather it be during parsing than during rendering. However, if the parser was given proper input and something managed to make it produce invalid UTF-8 that's something I will definitely want to fix. |
Yes, that's correct.
Yes. I run a phpbb Greek & Latin language board with posts going back to the early 2000s. Some Windows-1252 Greek text was apparently turned into mojibake in an earlier board upgrade. An example post that caused the libXML error: https://www.textkit.com/greek-latin-forum/viewtopic.php?p=48149 In the database it is:
That page is going to remain garbage until I can go through the database and fix all of the old Windows-1252 posts (if I can). However, for other threads, if it's only a word or two of offending text, a renderer error is more problematic.
For our use case, this would be ideal, as it replicates the pre-TextFormatter behavior. It's what I implemented by the utf8_encode() hack above. |
And here is something closer to the original text: Python code I run to fix:
|
Alright, I've thought about it and while there's nothing I can really do at the library's level I've added a check (21a52e5) to ensure the library rejects invalid input so it'll only generate valid output. Obviously it doesn't fix your issue but it allows applications to detect encoding issues earlier in the process. I suspect your issue is related to your database's encoding and to fix it properly you'll have to convert all of your tables to a UTF-8 character set such as I'm closing this issue for now but anyone is welcome to comment/reply with any pertinent information. |
Using the TextFormatter (version 0.13.1) included with the latest phpbb , we noticed some hard errors on old pages containing mojibake, due to non-XML approved characters. I did a quick hack to utf8_encode input where this was a problem. utf8_encode() for everything broke some non-utf8 encoded text in turn, so I wrapped it in an if statement below.
I noticed that the latest version of TextFormatter throws an exception, which may be more phpbb-friendly, whenever that project updates its version.
I'm guessing that the resolution of this issue will be "ask phpbb to update their TextFormatter version", but I wanted to file the issue here for completeness.
The text was updated successfully, but these errors were encountered: