Skip to content
/ html5gum Public
forked from untitaker/html5gum

A WHATWG-compliant HTML5 tokenizer and tag soup parser

License

Notifications You must be signed in to change notification settings

Ygg01/html5gum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

html5gum

docs.rs crates.io

html5gum is a WHATWG-compliant HTML tokenizer.

use std::fmt::Write;
use html5gum::{Tokenizer, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in Tokenizer::new(html).infallible() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", tag.name).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", hello_world).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", tag.name).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

It fully implements 13.2 of the WHATWG HTML spec and passes html5lib's tokenizer test suite, except that:

  • this implementation requires all input to be Rust strings and therefore valid UTF-8. There is no charset detection or handling of invalid surrogates, and the relevant html5lib tests are skipped in CI.

  • there's some remaining testcases to be decided on at issue 5.

A distinguishing feature of html5gum is that you can bring your own token datastructure and hook into token creation by implementing the Emitter trait. This allows you to:

  • Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.

  • Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.

html5gum was created out of a need to parse HTML tag soup efficiently. Previous options were to:

  • use quick-xml or xmlparser with some hacks to make either one not choke on bad HTML. For some (rather large) set of HTML input this works well (particularly quick-xml can be configured to be very lenient about parsing errors) and parsing speed is stellar. But neither can parse all HTML.

    For my own usecase html5gum is about 2x slower than quick-xml.

  • use html5ever's own tokenizer to avoid as much tree-building overhead as possible. This was functional but had poor performance for my own usecase (10-15x slower than quick-xml).

  • use lol-html, which would probably perform at least as well as html5gum, but comes with a closure-based API that I didn't manage to get working for my usecase.

Etymology

Why is this library called html5gum?

  • G.U.M: Giant Unreadable Match-statement

  • <insert "how it feels to chew 5 gum parse HTML" meme here>

License

Licensed under the MIT license, see ./LICENSE.

About

A WHATWG-compliant HTML5 tokenizer and tag soup parser

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 98.5%
  • Other 1.5%