Dan responded in this thread:
> >So: since Unicode has adopted an expansion mechanism that allows only 10FFFF
> >characters and since there will never, ever, be any data encoded outside
> >that range (we have all been assured), it is IMHO a good idea to reflect
> >that fact in your UTF-8 implementation. It is too late to levitate out of
> >the corner we are painted into. Building systems that prevent improper usage
> >is a good data-quality check.
> Just because Unicode havde decided to have UTF-16 for their 16-bit mode
> does not guarantee that the range will never be expanded (well it might
> not be called Unicode), ISO 10646 need not forever have this restriction.
> So plan for a possible future.
We *are* busy planning for a possible future. But the possible future
is not quite as you envision it.
ISO/IEC 10646 currently does not have any *formal* restriction on the
encoding of characters beyond U-0010FFFF. This is currently just a matter of
practical agreement among the committees doing the encoding (UTC and WG2).
However, a very minor fix to the normative clauses of 10646 could *normatively*
guarantee that no encoded characters would be encoded past U-0010FFFF,
and eliminate the (informally deprecated) private use planes past U-0010FFFF,
thereby guaranteeing that UTF-16 could be used to express any character
represented in UTF-8 or UCS-4.
The UTC is currently working on wording for such a change, to be proposed
as a DCOR for 10646. I believe it would be in everyone's interest to
nail down this loose end in the synchronization between Unicode and 10646.
> As I as a programmer will either handle my characters in 8, 16 or 32-bit
> words I can see no reason to place a restriction on UTF-8.
> If I use 16-bit words I will only use that range. I will never use UTF-16
> inside a program.
It is, of course, anyone's prerogative to use whichever encoding form
of 10646 best serves their programming interests.
However, if the proposed DCOR is successful, no 5- or 6-byte form of
UTF-8 will ever refer to an encoded character. Hence the specification
of valid UTF-8 in terms of the 1-, 2-, 3-, and 4-byte forms will be
a formally valid shortcut that programs can make. This is *also* in
everyone's interests, as it simplifies convertors, decreases the
expansion buffer size requirements for worst-case conversions and
storage allocations, and so on.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT