Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libXML errors in phpbb #108

Closed
jeidsath opened this issue Nov 23, 2018 · 4 comments
Closed

libXML errors in phpbb #108

jeidsath opened this issue Nov 23, 2018 · 4 comments

Comments

@jeidsath
Copy link

Using the TextFormatter (version 0.13.1) included with the latest phpbb , we noticed some hard errors on old pages containing mojibake, due to non-XML approved characters. I did a quick hack to utf8_encode input where this was a problem. utf8_encode() for everything broke some non-utf8 encoded text in turn, so I wrapped it in an if statement below.

I noticed that the latest version of TextFormatter throws an exception, which may be more phpbb-friendly, whenever that project updates its version.

I'm guessing that the resolution of this issue will be "ask phpbb to update their TextFormatter version", but I wanted to file the issue here for completeness.

            $this->checkUnsupported($xml);
            $flags = (\LIBXML_VERSION >= 20700) ? \LIBXML_COMPACT | \LIBXML_PARSEHUGE | \LIBXML_NOERROR : 0;
            $dom = new DOMDocument;                
            $dom->loadXML($xml, $flags);
            if (is_null($dom->documentElement))
                    $dom->loadXML(utf8_encode($xml), $flags);
            return $dom;
@JoshyPHP
Copy link
Member

Thanks for the report, I appreciate your diligence.

Let me make sure I understand everything. Some of your posts contain invalid UTF-8 which, if I remember correctly, causes a fatal error during rendering. Your workaround is to re-encode the XML to make it valid. The mojibake remains but at least it's valid UTF-8 and the page doesn't abruptly dies. Is that correct?

In that case, do you know what caused that text to contain invalid UTF-8? The library expects valid UTF-8 and does not validate the input so given invalid input, it's going to produce invalid output. That much is expected but I will look into potential workarounds such as rejecting invalid UTF-8 or replacing invalid characters, because if the program flow has to be interrupted I'd rather it be during parsing than during rendering.

However, if the parser was given proper input and something managed to make it produce invalid UTF-8 that's something I will definitely want to fix.

@jeidsath
Copy link
Author

The mojibake remains but at least it's valid UTF-8 and the page doesn't abruptly dies. Is that correct?

Yes, that's correct.

In that case, do you know what caused that text to contain invalid UTF-8?

Yes. I run a phpbb Greek & Latin language board with posts going back to the early 2000s. Some Windows-1252 Greek text was apparently turned into mojibake in an earlier board upgrade.

An example post that caused the libXML error: https://www.textkit.com/greek-latin-forum/viewtopic.php?p=48149

In the database it is:

<r>.. τί φατε πάντες πε�ὶ �μοῦ νέου ἀβάτατος;<br/> <br/> φίλετε �μὸν ἀβάτα� ἢ ο�κί;<br/> <br/> τόδ‘ �μὸν ἀβάτα� ἔστιν..<br/> <IMG src="http://i25.photobucket.com/albums/c79/g3string/kira_avatar_trans.gif"><s>[img]</s><URL url="http://i25.photobucket.com/albums/c79/g3string/kira_avatar_trans.gif"><LINK_TEXT text="http://i25.photobucket.com/albums/c79/g ... _trans.gif">http://i25.photobucket.com/albums/c79/g3string/kira_avatar_trans.gif</LINK_TEXT></URL><e>[/img]</e></IMG></r>

That page is going to remain garbage until I can go through the database and fix all of the old Windows-1252 posts (if I can). However, for other threads, if it's only a word or two of offending text, a renderer error is more problematic.

replacing invalid characters

For our use case, this would be ideal, as it replicates the pre-TextFormatter behavior. It's what I implemented by the utf8_encode() hack above.

@jeidsath
Copy link
Author

And here is something closer to the original text:

Python code I run to fix: moji.encode('windows-1252', 'replace').decode('utf8', 'replace')

'<r>.. τί φατε πάντες πε�?ὶ �?μοῦ νέου ἀβάτατος;<br/>\n<br/>\nφίλετε �?μὸν ἀβάτα�? ἢ ο�?κί;<br/>\n<br/>\nτόδ‘ �?μὸν ἀβάτα�? ἔστιν..<br/>\n<IMG src="http://i25.photobucket.com/albums/c79/g3string/kira_avatar_trans.gif"><s>[img]</s><URL url="http://i25.photobucket.com/albums/c79/g3string/kira_avatar_trans.gif"><LINK_TEXT text="http://i25.photobucket.com/albums/c79/g ... _trans.gif">http://i25.photobucket.com/albums/c79/g3string/kira_avatar_trans.gif</LINK_TEXT></URL><e>[/img]</e></IMG></r> \n'

@JoshyPHP
Copy link
Member

Alright, I've thought about it and while there's nothing I can really do at the library's level I've added a check (21a52e5) to ensure the library rejects invalid input so it'll only generate valid output. Obviously it doesn't fix your issue but it allows applications to detect encoding issues earlier in the process.

I suspect your issue is related to your database's encoding and to fix it properly you'll have to convert all of your tables to a UTF-8 character set such as utf8mb4. If you haven't already, you should ask around in phpBB forums for the recommended procedure. Feel free to tag me in any related discussion whether it's in forums or bug trackers.

I'm closing this issue for now but anyone is welcome to comment/reply with any pertinent information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants