The test format is JSON. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.
{"tests": [
{"description": "Test description",
"input": "input_string",
"output": [expected_output_tokens],
"initialStates": [initial_states],
"lastStartTag": last_start_tag,
"errors": [parse_errors]
}
]}
Multiple tests per file are allowed simply by adding more objects to the "tests" list.
Each parse error is an object that contains error code
and one-based
error location indices: line
and col
.
description
, input
and output
are always present. The other values
are optional.
test.input
is a string containing the characters to pass to the
tokenizer. Specifically, it represents the characters of the input
stream, and so implementations are expected to perform the processing
described in the spec's Preprocessing the input stream section
before feeding the result to the tokenizer.
If test.doubleEscaped
is present and true
, then test.input
is not
quite as described above. Instead, it must first be subjected to another
round of unescaping (i.e., in addition to any unescaping involved in the
JSON import), and the result of that represents the characters of the
input stream. Currently, the only unescaping required by this option is
to convert each sequence of the form \uHHHH (where H is a hex digit)
into the corresponding Unicode code point. (Note that this option also
affects the interpretation of test.output
.)
test.initialStates
is a list of strings, each being the name of a
tokenizer state which can be one of the following:
Data state
PLAINTEXT state
RCDATA state
RAWTEXT state
Script data state
CDATA section state
The test should be run once for each string, using it
to set the tokenizer's initial state for that run. If
test.initialStates
is omitted, it defaults to ["Data state"]
.
test.lastStartTag
is a lowercase string that should be used as "the
tag name of the last start tag to have been emitted from this
tokenizer", referenced in the spec's definition of appropriate end tag
token. If it is omitted, it is treated as if "no start tag has been
emitted from this tokenizer".
test.output
is a list of tokens, ordered with the first produced by
the tokenizer the first (leftmost) in the list. The list must mach the
complete list of tokens that the tokenizer should produce. Valid
tokens are:
["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}*, true*]
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
public_id
and system_id
are either strings or null
. correctness
is either true
or false
; true
corresponds to the force-quirks flag
being false, and vice-versa.
When the self-closing flag is set, the StartTag
array has true
as
its fourth entry. When the flag is not set, the array has only three
entries for backwards compatibility.
All adjacent character tokens are coalesced into a single
["Character", data]
token.
If test.doubleEscaped
is present and true
, then every string within
test.output
must be further unescaped (as described above) before
comparing with the tokenizer's output.
tokenizer/xmlViolation.test
differs from the above in a couple of
ways:
- The name of the single member of the top-level JSON object is "xmlViolationTests" instead of "tests".
- Each test's expected output assumes that implementation is applying the tweaks given in the spec's "Coercing an HTML DOM into an infoset" section.