diff options
author | teor <teor2345@gmail.com> | 2018-06-22 10:04:42 +1000 |
---|---|---|
committer | teor <teor2345@gmail.com> | 2018-06-22 10:04:42 +1000 |
commit | 436bb125540177d6c22193ae1f13580d826dc003 (patch) | |
tree | 1ce0ad4d33e87ad40da7c90b5d72db0bf00179e1 /proposals/285-utf-8.txt | |
parent | 4df184021b7c84cc47e2ed19a601b1e790b5b4fb (diff) | |
download | torspec-436bb125540177d6c22193ae1f13580d826dc003.tar.gz torspec-436bb125540177d6c22193ae1f13580d826dc003.zip |
Rewrite the UTF-8 specification in prop#285 so it is more specific
Use terminology from The Unicode Standard.
Ban byte-swapped byte order marks.
Add references to The Unicode Standard.
Diffstat (limited to 'proposals/285-utf-8.txt')
-rw-r--r-- | proposals/285-utf-8.txt | 51 |
1 files changed, 43 insertions, 8 deletions
diff --git a/proposals/285-utf-8.txt b/proposals/285-utf-8.txt index 6521e03..702a972 100644 --- a/proposals/285-utf-8.txt +++ b/proposals/285-utf-8.txt @@ -70,11 +70,46 @@ Status: Open 2.3. Which UTF-8 exactly? We define the allowable set of UTF-8 as: - * Encoding the codepoints U+01 through U+10FFFF, - * but excluding the codepoints U+D800 through U+DFFF, - * each encoded with the shortest possible encoding. - * without any BOM. - - - - + * Zero or mode Unicode scalar values (as defined by The Unicode + Standard, Version 3.1 or later), that is: + * Unicode code points U+00 through U+10FFFF, + * but excluding the code points U+D800 through U+DFFF, + * Excluding the scalar value U+00 (for compatibility with NUL-terminated + C strings), + * Serialized using the UTF-8 encoding scheme (as defined by The Unicode + Standard, Version 3.1 or later), in particular: + * each code point is encoded with the shortest possible encoding, + * Without a Unicode byte order mark (BOM, U+FEFF) at the start of the + descriptor. (BOMs are optional and not recommended in UTF-8. Allowing + a BOM would break backwards compatibility with ASCII-only Tor + implementations.) Byte-swapped BOMs (U+FFFE) must also be rejected. + + In order to remain compatible with future versions of The Unicode Standard, + we allow all possible code points, including Reserved code points. + + For languages with a conforming UTF-8 implementation (as defined by The + Unicode Standard, Version 3.1 or later), this is equivalent to well-formed + UTF-8, with the following additional rules: + * reject a BOM (U+FEFF) or byte-swapped BOM (U+FFFE) at the start of the + descriptor, + * reject U+00 at any point in the descriptor, + * accept all code point types used in UTF-8, including Control, + Private-Use, Noncharacter, and Reserved. (The Surrogate code point type + is not used in UTF-8.) + + For languages without a conforming UTF-8 implementation, we recommend + checking UTF-8 conformity based on the "Well-Formed UTF-8 Byte Sequences" + table from The Unicode Standard, Version 11 (or later). + + Note that U+00 is serialized to 0x00, but U+FEFF is serialized to 0xEFBBBF, + and U+FFFE is serialized to 0xEFBFBE. + +3. References + + The Unicode Standard, Version 11, Chapter 3. + In particular: + * Unicode scalar values: D76, page 120. + * UTF-8 encoding form: D92, pages 125-127. + * Well-Formed UTF-8 Byte Sequences: Table 3-7, page 126. + * Byte order mark: C11, page 83; D94, page 130. + * UTF-8 encoding scheme: D96, pages 130. |