Re: Question about Perl5 extended UTF-8 design

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 5 Nov 2015 18:25:05 +0100

It won't represent any valid Unicode codepoint (no standard scalar value
defined), so if you use those leading bytes, don't pretend it is for
"UTF-8" (not even "modified UTF-8" which is the variant created in Java for
its internal serialization of unrestricted 16-bit strings, including for
lone surrogates, and modified also in its representation of U+0000 as
<0xC0,0x80> instead of <0x00> in standard UTF-8). You'll have to create
your own charset identifier (e.g. "perl5-UTF-8-extended" or some name
derived from your Perl5 library) and say it is not fot use for interchange
of standard text.

The extra code points you'll get are then necessarily for private use (but
still not part of the standard PUA set), and have absolutely no defined
properties from the standard. They should not be used to represent any
Unicode character or character sequence. In any API taking some text input,
those code points will never be decoded and will behave on input like
encoding errors.

But these extra code points could be used to represent someting else such
as unique object identifier for internal use in your application, or
virtual object pointers, or or shared memory block handles,
file/pipe/stream I/O handles, service/API handles, user ids, security
tokens, 64-bit content hashes plus some binary flags,
placeholders/references for members in an external unencoded collection or
for URIs, or internal glyph ids when converting text for rendering with one
or more fonts, or some internal serialization of geometric
shapes/colors/styles/visual effects...)

In the standard UTF-8 those extra byte values are not "reserved" but
permanently assigned to be "invalid", and there are no valid encoded
sequences as long as 12 or 13 bytes (0xFF was reserved only in the old RFC
version of UTF-8 when it allowed code points up to 31 bits, but even this
RFC is obsolete and should no longer be used and it has never been approved
by Unicode).

2015-11-05 16:57 GMT+01:00 Karl Williamson <public_at_khwilliamson.com>:

> Hi,
>
> Several of us are wondering about the reason for reserving bits for the
> extended UTF-8 in perl5. I'm asking you because you are the apparent
> author of the commits that did this.
>
> To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the
> length of the sequence of bytes that comprise a single character to be 13
> bytes. This allows code points up to 2**72 - 1 to be represented. If the
> length had been instead 12 bytes, code points up to 2**66 - 1 could be
> represented, which is enough to represent any code point possible in a
> 64-bit word.
>
> The comments indicate that these extra bits are "reserved". So we're
> wondering what potential use you had thought of for these bits.
>
> Thanks
>
> Karl Williamson
>
Received on Thu Nov 05 2015 - 11:26:25 CST

This archive was generated by hypermail 2.2.0 : Thu Nov 05 2015 - 11:26:25 CST