Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowed Schemas option broken by bad html entities in input #693

Open
kiprobinsonknack opened this issue Feb 10, 2025 · 4 comments
Open
Labels

Comments

@kiprobinsonknack
Copy link

kiprobinsonknack commented Feb 10, 2025

To Reproduce

See also example script at bottom

  1. Use options to only allow a tags, only allowing href attribute, and only allowing https schema
  2. Sanitize a string where the href's protocol attribute contains html entities in decimal or hex format, but lacking the trailing semicolon
    • Example of html entities: &#112 (decimal), &#x70 (hex)
    • Example of full string to sanitize-html: <a href=htt&#112://example.com>ClickMe</a>

Expected behavior

Expected: The href attribute is stripped from the string. (<a>ClickMe</a>)

Actual: The href remains, with the ampersand in the attribute changed to &amp;. This results in a tag like: <a href="htt&amp;#112://example.com">ClickMe</a>

Describe the bug

If the html entities are correctly formatted with trailing semicolon, we get the expected output. But the whole point of sanitization is to handle bad input. 😄

In any case, this is clearly not https schema, so the attribute should be stripped.

Security Consideration

I don't think this constitutes a security vulnerability. When the browser sees <a href="htt&amp;#112://example.com">ClickMe</a>, it treats that like <a href="file:///htt&amp;#112://example.com">ClickMe</a>, which is to say, a link to /htt& with everything after the # treated as the page hash. Maybe there's some vulnerability if you know the user happens to have a file in a specific spot on their local system with a name ending in ampersand?

Details

Version of Node.js: v22.13.1

Server Operating System: MacOS Version 15.3 (24D60) (also reproducible on ubuntu, but i'm not sure the version)

Additional context:

n/a

Screenshots

n/a

Example Script

const sanitizeHtml = require('sanitize-html');

// Only allow https links
const options = {
  allowedTags: [
    'a',
  ],
  allowedAttributes: {
    a: [
      'href',
    ],
  },
  allowedSchemes: [
    'https',
  ],
};


// Example 1
// Input:    <a href=htt&#112://example.com>ClickMe</a>
// Expected: <a>ClickMe</a>
// Actual:   <a href="htt&amp;#112://example.com">ClickMe</a>
// Notes:
//     * The HTML entity `&#112` lacks the trailing semicolon
//     * The browser treats the actual result like "file:///htt&amp;#112://example.com">ClickMe</a>"
console.log(sanitizeHtml('<a href=htt&#112://example.com>ClickMe</a>', options));

// Example 2
// Input:    <a href=htt&#112;://example.com>ClickMe</a>
// Expected: <a>ClickMe</a>
// Actual:   <a>ClickMe</a>
// Notes:
//     * Unlike Example 1, this one includes the trailing semicolon in the html entity
console.log(sanitizeHtml('<a href=htt&#112;://example.com>ClickMe</a>', options));

// Example 3
// Input:    <a href=htt&#x70://example.com>ClickMe</a>
// Expected: <a>ClickMe</a>
// Actual:   <a href="htt&amp;#x70://example.com">ClickMe</a>
// Notes:
//     * This is the same as Example 1, except it uses hex encoding rather than decimal
console.log(sanitizeHtml('<a href=htt&#x70://example.com>ClickMe</a>', options));


// Example 4
// Input:    <a href=j&#97v&#97script:&#97lert(1)>ClickMe</a>
// Expected: <a>ClickMe</a>
// Actual:   <a href="j&amp;#97v&amp;#97script:&amp;#97lert(1)">ClickMe</a>
// Notes:
//     * This is _NOT_ an XSS vulnerability: the browser treats this link like `file:///j&amp;#97v&amp;#97script:&amp;#97lert(1)`
console.log(sanitizeHtml('<a href=j&#97v&#97script:&#97lert(1)>ClickMe</a>', options));
@boutell
Copy link
Member

boutell commented Feb 10, 2025

this looks like a weird bypass of allowedSchemas, even if the end result is not exploitable, so I agree we need to look at it.

@kiprobinsonknack
Copy link
Author

This might not have anything to do with bad entities actually:

console.log(sanitizeHtml('<a href=htt//example.com>ClickMe</a>', options));
// OUTPUT:  <a href="htt//example.com">ClickMe</a>

@kiprobinsonknack
Copy link
Author

kiprobinsonknack commented Feb 11, 2025

Actually, I'm not sure if this is really a bug after all 😅

When I said this:

When the browser sees <a href="htt&amp;#112://example.com">ClickMe</a>, it treats that like <a href="file:///htt&amp;#112://example.com">ClickMe</a>

I was mistaken; the browser treats it like a relative link to a file named htt&, not a file:// link. In other words, if i'm viewing the HTML file at https://example.com/some/path, and it sees <a href="htt&amp;#112://example.com">ClickMe</a>, that is a link to https://example.com/some/path/htt&. The same way that <a href="whatever.html#somehash">ClickMe</a> is treated like a link to https://example.com/some/path/whatever.html.

Similarly, <a href="htt//example.com">ClickMe</a> is treated like a link to https://example.com/some/path/htt//example.com.

I think there might still be something worth fixing regarding broken html entities. te&#115t is displayed by the browser (at least in Chrome) as test, but sanitize-html turns it into te&amp;#115t which is displayed by the browser as te&#115t.

Whether that is desired or not, I could kind of argue both ways...

@boutell
Copy link
Member

boutell commented Feb 11, 2025

Interesting. You seem to be saying the browser completely ignores the : and everything after it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants