Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is beautiful soup entity encoding your inserted html? #33

Open
sligodave opened this issue Mar 20, 2012 · 4 comments
Open

Is beautiful soup entity encoding your inserted html? #33

sligodave opened this issue Mar 20, 2012 · 4 comments

Comments

@sligodave
Copy link

Hi,
I could be wrong here but just in case, I said I'd bring this to your attention.

At the end of the parse_data method of the HTMLParser where you call "replaceWith" on the matched url;
It appears that with the step from BeautifulSoup 3.2.0 to BeautifulSoup 3.2.1
the inserted html is now being entity encoded, thus breaking things.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("")
soup.insert(0, "<b>YAY</b>")
print unicode(soup)

The above under BS 3.2.0 printed:

`````` YAY```

Under BS 3.2.1 it prints:
&lt;b&gt;YAY&lt;/b&gt;

I haven't had the time to dig an awful lot but the solution might be to create a BS representation of the replacement html and pass that to replaceWith.

Thanks,
Dave

@coleifer
Copy link
Contributor

Yep you're totally right, I thought I had opened an issue to that effect here but I guess I had not. I actually opened up a bug on their launchpad and need to respond with some info. I will use the example you provided, thanks for that. The maintainer has some suggestions and you can follow up here:
https://bugs.launchpad.net/beautifulsoup/+bug/949074

@coleifer
Copy link
Contributor

May be interested in my replacement proejct for djangoembed, http://micawber.readthedocs.org/ -- the html parser does not ahve this issue.

@azreda
Copy link

azreda commented Jul 4, 2012

This issue remains, breaking the HTML parsing method. Downgrading to 3.2.0 is a temporary solution.

The proper solution is described on the bug report:
"If you put a string into the soup its XML characters should always be escaped. Since you want "YAY" to be treated as an HTML tag, you can create a Tag object instead"

@coleifer
Copy link
Contributor

coleifer commented Jul 5, 2012

You can also see on lines 130/131 of micawber, I have fixed this:
https://github.com/coleifer/micawber/blob/master/micawber/parsers.py#L130

Please note - i am not working on this project anymore. I've written a replacement:

https://github.com/coleifer/micawber

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants