Re: 8-bit text which is supposed to be UTF-8 but isn't

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 31 2000 - 10:51:19 EST


Dan responded in this thread:

> >
> >So: since Unicode has adopted an expansion mechanism that allows only 10FFFF
> >characters and since there will never, ever, be any data encoded outside
> >that range (we have all been assured), it is IMHO a good idea to reflect
> >that fact in your UTF-8 implementation. It is too late to levitate out of
> >the corner we are painted into. Building systems that prevent improper usage
> >is a good data-quality check.
>
> Just because Unicode havde decided to have UTF-16 for their 16-bit mode
> does not guarantee that the range will never be expanded (well it might
> not be called Unicode), ISO 10646 need not forever have this restriction.
> So plan for a possible future.

We *are* busy planning for a possible future. But the possible future
is not quite as you envision it.

ISO/IEC 10646 currently does not have any *formal* restriction on the
encoding of characters beyond U-0010FFFF. This is currently just a matter of
practical agreement among the committees doing the encoding (UTC and WG2).

However, a very minor fix to the normative clauses of 10646 could *normatively*
guarantee that no encoded characters would be encoded past U-0010FFFF,
and eliminate the (informally deprecated) private use planes past U-0010FFFF,
thereby guaranteeing that UTF-16 could be used to express any character
represented in UTF-8 or UCS-4.

The UTC is currently working on wording for such a change, to be proposed
as a DCOR for 10646. I believe it would be in everyone's interest to
nail down this loose end in the synchronization between Unicode and 10646.

> As I as a programmer will either handle my characters in 8, 16 or 32-bit
> words I can see no reason to place a restriction on UTF-8.
> If I use 16-bit words I will only use that range. I will never use UTF-16
> inside a program.

It is, of course, anyone's prerogative to use whichever encoding form
of 10646 best serves their programming interests.

However, if the proposed DCOR is successful, no 5- or 6-byte form of
UTF-8 will ever refer to an encoded character. Hence the specification
of valid UTF-8 in terms of the 1-, 2-, 3-, and 4-byte forms will be
a formally valid shortcut that programs can make. This is *also* in
everyone's interests, as it simplifies convertors, decreases the
expansion buffer size requirements for worst-case conversions and
storage allocations, and so on.

--Ken

>
> Dan
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT