This is small, free-standing, public domain library encodes a stream of code points into UTF-7 and vice versa. It requires only a single, small struct for its entire internal state.
Initialize a context struct (struct utf7
), set its buf
and len
pointers to a buffer for input or output, then "pump" the encoder or
decoder similarly to zlib.
char buffer[256];
struct utf7 ctx;
utf7_init(&ctx, "=@"); /* indirectly encode = and @ */
ctx.buf = buffer;
ctx.len = sizeof(buffer));
The context must be re-initialized before switching between encoding and decoding.
void utf7_init(struct utf7 *, const char *indirect);
The utf7_init()
function initializes a context for either encoding
or decoding. The indirect
argument is optional and may be NULL. It
is ignored when the context is used for decoding. By default the
encoder directly encodes every character that is permitted to be
directly encoded. The indirect
argument subtracts from this set of
directly-encoded characters. This may be desirable for certain
characters, such as =
(EQUALS SIGN).
int utf7_encode(struct utf7 *, long codepoint);
The utf7_encode()
function writes a code point to the buffer pointed
to by the context. The buf
and len
fields are updated on the
context as output is produced. Code points outside the Basic
Multilingual Plane (BMP) are automatically encoded as surrogate halves
for UTF-7.
When there is nothing more to encode, call the encoder with the
special UTF7_FLUSH
code point to force all remaining output from the
context. This behaves just like any other code point, particularly
with respect to the return values below, but obviously this value will
not be written into the output. After flushing, the context will
effectively be reinitialized.
There are two possible return values:
-
UTF7_OK
: The operation completed successfully. -
UTF7_FULL
: The output buffer filled up before the operation could be completed. Consume the output buffer as appropriate for your application, update the context'sbuf
andlen
to a fresh buffer, and continue the operation by calling it again with the exact same arguments.
long utf7_decode(struct utf7 *);
This function operates in reverse, consuming input from buf
on the
context and returning a code point. Surrogate halves in the underlying
stream are automatically recombined into a non-BMP code point.
There are four possible return values:
-
UTF7_OK
: Input was exhausted, but this is a valid ending for a stream. -
UTF7_INCOMPLETE
: Input was exhausted but more input is expected. If there is no more input, this should be treated as an error since the input was truncated. -
UTF7_INVALID
: The input is not valid UTF-7. The offending byte is pointed to bybuf
. -
Any other return value is a code point.
Under tests/
is a simple command line tool called conv7
that
converts between UTF-7 and other encodings via standard input and
standard output. For example, to convert a UTF-8 file to UTF-7:
$ conv7 -f utf-8 <in-u8.txt >out-u7.txt
Or vice versa:
$ conv7 -t utf-8 <in-u7.txt >out-u8.txt