Skip to content

Latest commit

 

History

History
 
 

tokenizer

Tokenizer tests

The test format is JSON. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

Basic Structure

{"tests": [
    {"description": "Test description",
    "input": "input_string",
    "output": [expected_output_tokens],
    "initialStates": [initial_states],
    "lastStartTag": last_start_tag,
    "errors": [parse_errors]
    }
]}

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

Each parse error is an object that contains error code and one-based error location indices: line and col.

description, input and output are always present. The other values are optional.

Test set-up

test.input is a string containing the characters to pass to the tokenizer. Specifically, it represents the characters of the input stream, and so implementations are expected to perform the processing described in the spec's Preprocessing the input stream section before feeding the result to the tokenizer.

If test.doubleEscaped is present and true, then test.input is not quite as described above. Instead, it must first be subjected to another round of unescaping (i.e., in addition to any unescaping involved in the JSON import), and the result of that represents the characters of the input stream. Currently, the only unescaping required by this option is to convert each sequence of the form \uHHHH (where H is a hex digit) into the corresponding Unicode code point. (Note that this option also affects the interpretation of test.output.)

test.initialStates is a list of strings, each being the name of a tokenizer state which can be one of the following:

  • Data state
  • PLAINTEXT state
  • RCDATA state
  • RAWTEXT state
  • Script data state
  • CDATA section state

The test should be run once for each string, using it to set the tokenizer's initial state for that run. If test.initialStates is omitted, it defaults to ["Data state"].

test.lastStartTag is a lowercase string that should be used as "the tag name of the last start tag to have been emitted from this tokenizer", referenced in the spec's definition of appropriate end tag token. If it is omitted, it is treated as if "no start tag has been emitted from this tokenizer".

Test results

test.output is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the complete list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}*, true*]
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]

public_id and system_id are either strings or null. correctness is either true or false; true corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the StartTag array has true as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

All adjacent character tokens are coalesced into a single ["Character", data] token.

If test.doubleEscaped is present and true, then every string within test.output must be further unescaped (as described above) before comparing with the tokenizer's output.

xmlViolation tests

tokenizer/xmlViolation.test differs from the above in a couple of ways:

  • The name of the single member of the top-level JSON object is "xmlViolationTests" instead of "tests".
  • Each test's expected output assumes that implementation is applying the tweaks given in the spec's "Coercing an HTML DOM into an infoset" section.