Skip to content

Commit

Permalink
expose WTF-8 converters on all platforms
Browse files Browse the repository at this point in the history
  • Loading branch information
mflatt committed Dec 25, 2020
1 parent dd1061f commit 4d7f52e
Show file tree
Hide file tree
Showing 8 changed files with 215 additions and 181 deletions.
41 changes: 30 additions & 11 deletions pkgs/racket-doc/scribblings/reference/bytes.scrbl
Original file line number Diff line number Diff line change
Expand Up @@ -453,30 +453,44 @@ Certain encoding combinations are always available:
@item{@racket[(bytes-open-converter "platform-UTF-8" "platform-UTF-16")]
--- converts UTF-8 to UTF-16 on @|AllUnix|, where each UTF-16
code unit is a sequence of two bytes ordered by the current
platform's endianness. On Windows, the input can include
encodings that are not valid UTF-8, but which naturally extend the
UTF-8 encoding to support unpaired surrogate code units, and the
output is a sequence of UTF-16 code units (as little-endian byte
pairs), potentially including unpaired surrogates.}
platform's endianness. On Windows, the conversion is the same
as @racket[(bytes-open-converter "WTF-8" "WTF-16")] to support
unpaired surrogate code units.}

@item{@racket[(bytes-open-converter "platform-UTF-8-permissive" "platform-UTF-16")]
--- like @racket[(bytes-open-converter "platform-UTF-8" "platform-UTF-16")],
but an input byte that is not part of a valid UTF-8 encoding
sequence (or valid for the unpaired-surrogate extension on
Windows) is effectively replaced with @racket[(char->integer #\?)].}
Windows) is effectively replaced with @racketvalfont{#\uFFFD}.}

@item{@racket[(bytes-open-converter "platform-UTF-16" "platform-UTF-8")]
--- converts UTF-16 (bytes ordered by the current platform's
endianness) to UTF-8 on @|AllUnix|. On Windows, the input can
include UTF-16 code units that are unpaired surrogates, and the
corresponding output includes an encoding of each surrogate in a
natural extension of UTF-8. On @|AllUnix|, surrogates are
endianness) to UTF-8 on @|AllUnix|. On Windows, the conversion
is the same as @racket[(bytes-open-converter "WTF-16" "WTF-8")]
to support unpaired surrogates. On @|AllUnix|, surrogates are
assumed to be paired: a pair of bytes with the bits @code{#xD800}
starts a surrogate pair, and the @code{#x03FF} bits are used from
the pair and following pair (independent of the value of the
@code{#xDC00} bits). On all platforms, performance may be poor
when decoding from an odd offset within an input byte string.}

@item{@racket[(bytes-open-converter "WTF-8" "WTF-16")]
--- converts the WTF-8 @cite["Sapin18"] superset of UTF-8 to a
superset of UTF-16 to support unpaired surrogate code units, where
each UTF-16 code unit is a sequence of two bytes ordered by the
current platform's endianness.}

@item{@racket[(bytes-open-converter "WTF-8-permissive" "WTF-16")]
--- like @racket[(bytes-open-converter "WTF-8" "WTF-16")],
but an input byte that is not part of a valid WTF-8 encoding
sequence is effectively replaced with @racketvalfont{#\uFFFD}.}

@item{@racket[(bytes-open-converter "WTF-16" "WTF-8")]
--- converts the WTF-16 @cite["Sapin18"] superset of UTF-16 to the
WTF-8 superset of UTF-8. The input can include UTF-16 code units
that are unpaired surrogates, and the corresponding output includes
an encoding of each surrogate in a natural extension of UTF-8.}

]

A newly opened byte converter is registered with the current custodian
Expand All @@ -501,7 +515,12 @@ current executable's directory at run time, and the DLL must either
supply @tt{_errno} or link to @filepath{msvcrt.dll} for @tt{_errno};
otherwise, only the guaranteed combinations are available.

Use @racket[bytes-convert] with the result to convert byte strings.}
Use @racket[bytes-convert] with the result to convert byte strings.

@history[#:changed "7.9.0.17" @elem{Added built-in converters for
@racket["WTF-8"],
<@racket["WTF-8-permissive"], and
@racket["WTF-16"].}]}


@defproc[(bytes-close-converter [converter bytes-converter?]) void]{
Expand Down
6 changes: 6 additions & 0 deletions pkgs/racket-doc/scribblings/reference/reference.scrbl
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,12 @@ The @racketmodname[racket] library combines
#:url "https://doi.org/10.1017/CBO9780511574962"
#:date "1999")

(bib-entry #:key "Sapin18"
#:author "Simon Sapin"
#:title "The WTF-8 Encoding"
#:url "http://simonsapin.github.io/wtf-8/"
#:date "2018")

(bib-entry #:key "Shan04"
#:author "Ken Shan"
#:title "Shift to Control"
Expand Down
69 changes: 39 additions & 30 deletions pkgs/racket-test-core/tests/racket/unicode.rktl
Original file line number Diff line number Diff line change
Expand Up @@ -889,11 +889,14 @@
(go (lambda (n p) (read-n n p 1)))
(go (lambda (n p) (read-n n p 2))))))
;; Test UTF-16
(let ([c (bytes-open-converter "platform-UTF-8" "platform-UTF-16")])
(for ([c (list (bytes-open-converter "platform-UTF-8" "platform-UTF-16")
(bytes-open-converter "WTF-8" "WTF-16"))]
[wtf? (list (eq? 'windows (system-type))
#t)])
(let-values ([(s2 n status) (bytes-convert c s)])
(case parse-status
[(surrogate1 surrogate2)
(if (eq? (system-type) 'windows)
(if wtf?
(begin
(if (eq? parse-status 'surrogate1)
(test 'aborts 'status status)
Expand Down Expand Up @@ -975,20 +978,23 @@
basic-utf-8-tests))

;; Further UTF-16 tests
(let ([c (bytes-open-converter "platform-UTF-16" "platform-UTF-8")])
(for ([c (list (bytes-open-converter "platform-UTF-16" "platform-UTF-8")
(bytes-open-converter "WTF-16" "WTF-8"))]
[wtf? (list (eq? 'windows (system-type))
#t)])
(let-values ([(s n status) (bytes-convert c (bytes-append
(integer->integer-bytes #xD800 2 #f)
(integer->integer-bytes #xDC00 2 #f)))])
(test-values (list #"" 0 'aborts)
(lambda () (bytes-convert c (integer->integer-bytes #xD800 2 #f) )))
;; Windows: unpaired surrogates allowed:
(when (eq? 'windows (system-type))
;; WTF: unpaired surrogates allowed:
(when wtf?
(test-values (list #"" 0 'aborts)
(lambda () (bytes-convert c (integer->integer-bytes #xD8FF 2 #f))))
(test-values (list #"\355\277\277" 2 'complete)
(lambda () (bytes-convert c (integer->integer-bytes #xDFFF 2 #f)))))
;; Non-windows: after #xD800 bits, surrogate pair is assumed
(unless (eq? 'windows (system-type))
;; UTF: after #xD800 bits, surrogate pair is assumed
(unless wtf?
(test-values (list #"" 0 'aborts)
(lambda () (bytes-convert c (integer->integer-bytes #xD800 2 #f))))
(test-values (list #"" 0 'aborts)
Expand Down Expand Up @@ -1027,29 +1033,32 @@
(test-values '(#"" complete)
(lambda () (bytes-convert-end c))))

(when (eq? (system-type) 'windows)
(let ([c (bytes-open-converter "platform-UTF-8-permissive" "platform-UTF-16")])
;; Check that we use all 6 bytes of #"\355\240\200\355\260\200" or none
(test-values (list 12 6 'complete)
(lambda ()
(bytes-convert c #"\355\240\200\355\260\200" 0 6 (make-bytes 12))))
;; If we can't look all the way to the end, reliably abort without writing:
(let ([s (make-bytes 12 (char->integer #\x))])
(let loop ([n 1])
(unless (= n 6)
(test-values (list 0 0 'aborts)
(lambda ()
(bytes-convert c #"\355\240\200\355\260\200" 0 n s)))
(test #"xxxxxxxxxxxx" values s) ; no writes to bytes string
(loop (add1 n)))))
(let ([s (make-bytes 12 (char->integer #\x))])
(let loop ([n 0])
(unless (= n 12)
(test-values (list 0 0 'continues)
(lambda ()
(bytes-convert c #"\355\240\200\355\260\200" 0 6 (make-bytes n))))
(test #"xxxxxxxxxxxx" values s) ; no writes to bytes string
(loop (add1 n)))))))
(for ([c (append
(if (eq? (system-type) 'windows)
(list (bytes-open-converter "platform-UTF-8-permissive" "platform-UTF-16"))
null)
(list (bytes-open-converter "WTF-8-permissive" "WTF-16")))])
;; Check that we use all 6 bytes of #"\355\240\200\355\260\200" or none
(test-values (list 12 6 'complete)
(lambda ()
(bytes-convert c #"\355\240\200\355\260\200" 0 6 (make-bytes 12))))
;; If we can't look all the way to the end, reliably abort without writing:
(let ([s (make-bytes 12 (char->integer #\x))])
(let loop ([n 1])
(unless (= n 6)
(test-values (list 0 0 'aborts)
(lambda ()
(bytes-convert c #"\355\240\200\355\260\200" 0 n s)))
(test #"xxxxxxxxxxxx" values s) ; no writes to bytes string
(loop (add1 n)))))
(let ([s (make-bytes 12 (char->integer #\x))])
(let loop ([n 0])
(unless (= n 12)
(test-values (list 0 0 'continues)
(lambda ()
(bytes-convert c #"\355\240\200\355\260\200" 0 6 (make-bytes n))))
(test #"xxxxxxxxxxxx" values s) ; no writes to bytes string
(loop (add1 n))))))

;; Seems like this sort of thing should be covered above, and maybe it
;; it after some other corrections. But just in case:
Expand Down
Loading

0 comments on commit 4d7f52e

Please sign in to comment.