Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? from Philippe Verdy via Unicode on 2017-07-24 (Unicode Mail List Archive)

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Tue, 25 Jul 2017 01:52:09 +0200

2017-07-25 0:35 GMT+02:00 Doug Ewell via Unicode <unicode_at_unicode.org>:

> J Decker wrote:
>
> > I generally accepted any utf-8 encoding up to 31 bits though ( since
> > I was going from the original spec, and not what was effective limit
> > based on unicode codepoint space)
>
> Hey, everybody: Don't do that.
>
> UTF-8 has been constrained to the Unicode code space (maximum U+10FFFF,
> four bytes) for almost fourteen years now.

I fully agree. This is now an essential part of UTF-8 that has helped
secure it (including the dangerous unbound loops scanning through buffers
in memory), and also helped improve performance (when unrolling loops that
you no longer need to count separately, the code expansion is not so large
that you can't do correct branch prediction and can benefit of caching in
code. Due to the way the UCS code spacez is allocated and how they are
used, the branches in your code have very distinctive patterns that are
easy to enumerate; test coverage for those branches is possible without
explosing combinatorially: this eliminates the need of heuristics.

And about the RFC we were discussing, it is rather recent compared to the
approved stabilization of UTF-8 and finally its endorsement by the
industry. UTF-8 is strictly bound to 4 bytes and nothing more. This allows
other things to be developed on top of this fact and used now as a checked
assumption that cannot be broken except by software bugs that will soon
create security problems when checked assumptions will no longer be checked
throughout a processing chain.

The old RFC was not "UTF-8" (even if that name was proposed, it was not
really assigned) but an early proposal in discussion that did not reach the
level of standard or best practice, it was experimental and at that time
there were several other candidates (including also UTF-7 which is now
almost abandoned, and BOCU-8 which is now marginal but was also bound to
the 17 planes limit). The encoding old RFC should just be given another
name, but it is not used for encoding only text, it was describing in fact
a binary format (but for generic variable binary encoding format of numbers
there are now better candidates, which are also not limited to just 31 bits
or even just to unsigned integers, and are also faster to process and more
compact, and have more interesting properties for code analysis and
resistance to encoding and transmission/storage errors).

In the IANA database for charsets, the old RFC encoding has a separate
identifier, but "UTF-8" refers to RFC 3629 (IETF standard 63); the former
proposals in RFC 2279 or RFC 2044 have never been approved standards, but
just drafts mapped in IANA as the obsolete "UNICODE-1-1-UTF-8" (retired
later as it was never approved by Unicode).

The only remaining "charset" in the IANA database that refers to 31 bit
code points is "ISO-10646-UCS-4", but it does not use variable encoding and
does not specify any byte order, it is just a basic subtype for a range of
positive integers, and without any restriction of use, and not necessarily
repreenting text, but it is very inefficient way to encode them, only meant
as an internal temporary transform in transient memory or CPU registers (at
least for 32bit CPUs or higher: it is now almost alway the case today even
in embedded systems, as 4-, 8- or16-bit CPUs are almost dead or will not be
used for international text processing; even the simplest keyboard
controlers that manage ~100-150 keys and a few leds, and reporting at 1kHz
for the fastest ones, are now internally using 32bit CPUs)
Received on Mon Jul 24 2017 - 18:52:55 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 24 2017 - 18:52:55 CDT