Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle unicode in transformIterable better #88

Merged
merged 6 commits into from
May 17, 2021
Merged

Conversation

mauritsvanrees
Copy link
Member

I think this improves the fix from #85

@coveralls
Copy link

coveralls commented May 6, 2021

Coverage Status

Coverage increased (+0.2%) to 81.256% when pulling 3fa043c on maurits/transforms-iterable into 8b1d165 on master.

@mauritsvanrees
Copy link
Member Author

A small explanation. Our transformIterable calls getHTMLSerializer (iterable) from repoze.iter, for which Fred already thought of making a fix.

I cannot reproduce an exception or error on my laptop.
But in a Python 3 prompt you can see some problems when the iterable could contain both bytes and unicode:

>>> from repoze.xmliter.utils import getHTMLSerializer
>>> print(getHTMLSerializer([b"<p>hello</p>", u"<p>bye</p>"]))
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>hello</p>p   &gt;   b   y   e   /   p   &gt;   </body></html>
>>> print(getHTMLSerializer([u"<p>hello</p>", b"<p>bye</p>"]))
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>hello</p>��</body></html>

So in the first case I get too many spaces, and in the second I get unknown characters.

On Plone 4.3 Python 2.7 with the same versions (plone.app.blocks 4.3.2 and repoze.xmliter 0.6) the first one gives the same problem, and the second one is slightly better but still off:

<html><body><p>hello</p>&#28732;&#25150;&#25977;&#12092;&#15984;</body></html>

For each item in the iterable, repoze.xmliter feeds it to an lxml function feed which has as documentation:

The argument should be an 8-bit string buffer containing encoded data, although Unicode is supported as long as both string types are not mixed.

So it seemed best to me to make sure in plone.app.blocks that we only pass bytes.

Reminder:

  • On Python 2 both b"" and "" are of instance bytes, which is the same as str; and u"" is unicode
  • On Python 3 b"" is of instance bytes, and both "" and u"" are of instance str, which is the old unicode.

For clarity I only use bytes and unicode above and in the code

@mauritsvanrees
Copy link
Member Author

One more thing: I saw the traceback with Fred today. The exception that is raised there is lxml.etree.ParserError. The plone.app.blocks code has a try/except for ParseError without an "r".

Both exist. From the help:

Help on class ParseError in module lxml.etree:

class ParseError(LxmlSyntaxError)
 |  ParseError(message, code, line, column, filename=None)
 |  
 |  Syntax error while parsing an XML document.
 |  
 |  For compatibility with ElementTree 1.3 and later.
 |  
 |  Method resolution order:
 |      ParseError
 |      LxmlSyntaxError
 |      LxmlError
 |      Error
 |      builtins.SyntaxError
 |      builtins.Exception
 |      builtins.BaseException
 |      builtins.object

...

Help on class ParserError in module lxml.etree:

class ParserError(LxmlError)
 |  Internal lxml parser error.
 |  
 |  Method resolution order:
 |      ParserError
 |      LxmlError
 |      Error
 |      builtins.Exception
 |      builtins.BaseException
 |      builtins.object

So we could try/except both, or maybe use lxmlError but that might catch some too many errors.
Anyway, no change really needed for now. We briefly adapted the exception handling, but that did not fix the empty Mosaic page.

@jensens
Copy link
Member

jensens commented May 7, 2021

Looking at this issue to me it looks like we should start dropping Python 2 support.

Copy link
Member

@ksuess ksuess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your effort, @mauritsvanrees . It's for sure best to use safe_bytes from Products.CMFPlone.utils, as you do here. And adding tests is great.

But, without diving into mosaic (what I do not want, as my focus is on Plone 6) I cannot asses what plone.app.blocks duty is and why here is the need to deal with bytes on this abstraction layer.

@thet thet force-pushed the maurits/transforms-iterable branch from 3fa043c to 338391d Compare May 15, 2021 14:07
mauritsvanrees and others added 5 commits May 16, 2021 23:05
And prepared this file for having another TestCase class.
Especially we pass both bytes and unicode to the transformIterable method.
Especially fix parsing combinations of bytes and unicode.
I think this improves the fix from #85
which fixed plone/plone.app.mosaic#480
@thet thet force-pushed the maurits/transforms-iterable branch from 338391d to b6d432b Compare May 16, 2021 21:11
Copy link
Member

@thet thet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thet thet merged commit bfc89d1 into master May 17, 2021
@thet thet deleted the maurits/transforms-iterable branch May 17, 2021 08:35
@mauritsvanrees
Copy link
Member Author

I have released 5.0.0 with this and the other recent fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants