Handle unicode in transformIterable better #88

mauritsvanrees · 2021-05-06T16:18:56Z

I think this improves the fix from #85

coveralls · 2021-05-06T16:56:19Z

Coverage increased (+0.2%) to 81.256% when pulling 3fa043c on maurits/transforms-iterable into 8b1d165 on master.

mauritsvanrees · 2021-05-06T19:14:49Z

A small explanation. Our transformIterable calls getHTMLSerializer (iterable) from repoze.iter, for which Fred already thought of making a fix.

I cannot reproduce an exception or error on my laptop.
But in a Python 3 prompt you can see some problems when the iterable could contain both bytes and unicode:

>>> from repoze.xmliter.utils import getHTMLSerializer
>>> print(getHTMLSerializer([b"<p>hello</p>", u"<p>bye</p>"]))
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>hello</p>p   &gt;   b   y   e   /   p   &gt;   </body></html>
>>> print(getHTMLSerializer([u"<p>hello</p>", b"<p>bye</p>"]))
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>hello</p>��</body></html>

So in the first case I get too many spaces, and in the second I get unknown characters.

On Plone 4.3 Python 2.7 with the same versions (plone.app.blocks 4.3.2 and repoze.xmliter 0.6) the first one gives the same problem, and the second one is slightly better but still off:

<html><body><p>hello</p>&#28732;&#25150;&#25977;&#12092;&#15984;</body></html>

For each item in the iterable, repoze.xmliter feeds it to an lxml function feed which has as documentation:

The argument should be an 8-bit string buffer containing encoded data, although Unicode is supported as long as both string types are not mixed.

So it seemed best to me to make sure in plone.app.blocks that we only pass bytes.

Reminder:

On Python 2 both b"" and "" are of instance bytes, which is the same as str; and u"" is unicode
On Python 3 b"" is of instance bytes, and both "" and u"" are of instance str, which is the old unicode.

For clarity I only use bytes and unicode above and in the code

mauritsvanrees · 2021-05-07T09:48:12Z

One more thing: I saw the traceback with Fred today. The exception that is raised there is lxml.etree.ParserError. The plone.app.blocks code has a try/except for ParseError without an "r".

Both exist. From the help:

Help on class ParseError in module lxml.etree:

class ParseError(LxmlSyntaxError)
 |  ParseError(message, code, line, column, filename=None)
 |  
 |  Syntax error while parsing an XML document.
 |  
 |  For compatibility with ElementTree 1.3 and later.
 |  
 |  Method resolution order:
 |      ParseError
 |      LxmlSyntaxError
 |      LxmlError
 |      Error
 |      builtins.SyntaxError
 |      builtins.Exception
 |      builtins.BaseException
 |      builtins.object

...

Help on class ParserError in module lxml.etree:

class ParserError(LxmlError)
 |  Internal lxml parser error.
 |  
 |  Method resolution order:
 |      ParserError
 |      LxmlError
 |      Error
 |      builtins.Exception
 |      builtins.BaseException
 |      builtins.object

So we could try/except both, or maybe use lxmlError but that might catch some too many errors.
Anyway, no change really needed for now. We briefly adapted the exception handling, but that did not fix the empty Mosaic page.

jensens · 2021-05-07T13:54:54Z

Looking at this issue to me it looks like we should start dropping Python 2 support.

ksuess

Thanks a lot for your effort, @mauritsvanrees . It's for sure best to use safe_bytes from Products.CMFPlone.utils, as you do here. And adding tests is great.

But, without diving into mosaic (what I do not want, as my focus is on Plone 6) I cannot asses what plone.app.blocks duty is and why here is the need to deal with bytes on this abstraction layer.

And prepared this file for having another TestCase class.

Especially we pass both bytes and unicode to the transformIterable method.

Especially fix parsing combinations of bytes and unicode. I think this improves the fix from #85 which fixed plone/plone.app.mosaic#480

thet

LGTM

mauritsvanrees · 2021-06-11T15:52:25Z

I have released 5.0.0 with this and the other recent fixes.

mauritsvanrees requested review from agitator, ksuess and fredvd May 6, 2021 16:19

mauritsvanrees mentioned this pull request May 6, 2021

Tests: refactor buildout configs, use tox, use GitHub actions #89

Merged

ksuess reviewed May 13, 2021

View reviewed changes

thet force-pushed the maurits/transforms-iterable branch from 3fa043c to 338391d Compare May 15, 2021 14:07

mauritsvanrees and others added 5 commits May 16, 2021 23:05

Refactored test_transforms to not have so much duplicate code.

a7c7c12

And prepared this file for having another TestCase class.

Added tests for the ParseXML class.

ac43a83

Especially we pass both bytes and unicode to the transformIterable method.

Fixed parsing of unicodes in ParseXML.transformIterable.

3c09661

Especially fix parsing combinations of bytes and unicode. I think this improves the fix from #85 which fixed plone/plone.app.mosaic#480

Define safe_bytes on Plone 4.3 and 5.0.

62d265c

Remove 4.3/5.0 compatibility code.

b6d432b

thet force-pushed the maurits/transforms-iterable branch from 338391d to b6d432b Compare May 16, 2021 21:11

Add myself to the changelog, mostly to please mr.roboto.

4bc1e9e

thet approved these changes May 17, 2021

View reviewed changes

thet merged commit bfc89d1 into master May 17, 2021

thet deleted the maurits/transforms-iterable branch May 17, 2021 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle unicode in transformIterable better #88

Handle unicode in transformIterable better #88

mauritsvanrees commented May 6, 2021

coveralls commented May 6, 2021 •

edited

Loading

mauritsvanrees commented May 6, 2021

mauritsvanrees commented May 7, 2021

jensens commented May 7, 2021

ksuess left a comment

thet left a comment

mauritsvanrees commented Jun 11, 2021

Handle unicode in transformIterable better #88

Handle unicode in transformIterable better #88

Conversation

mauritsvanrees commented May 6, 2021

coveralls commented May 6, 2021 • edited Loading

mauritsvanrees commented May 6, 2021

mauritsvanrees commented May 7, 2021

jensens commented May 7, 2021

ksuess left a comment

Choose a reason for hiding this comment

thet left a comment

Choose a reason for hiding this comment

mauritsvanrees commented Jun 11, 2021

coveralls commented May 6, 2021 •

edited

Loading