Re: any unicode conversion tools?

From: Jon Hanna (jon@hackcraft.net)
Date: Fri May 07 2004 - 11:09:02 CDT


> >> it can be represented in UTF-8 format as:
> >> 1 byte: still 2F
> >> 2 bytes: C0 AF (illegal)
> >> 3 bytes: E0 80 AF (illegal)
> >
> > Thanks for keeping the indication that the last two are illegal with
> > UTF-8. But
> > you should have better never listed them (even if there still exists
> > some legacy
> > converters that will accept them, no one should generate them). Note
> > also that
> > UTF-8 encoded sequences can be up to 5 bytes long...
>
> How is that possible. I was under the impression that a UTF-8 sequence

> could never be more than 4 bytes (i.e. U+10FFFF becomes F4 8F BF BF).

UTF-8 as defined in Unicode4.0 can never be greater than 4 bytes long. However
illegal sequences can be up to 6 (not just 5) bytes long.

UTF-8 has been variously defined in various standards and specs as an encoding
of either Unicode or of ISO 10646. ISO 10646 has space up to U+7FFFFFFF,
although there is a commitment not to use anything about U+10FFFF to maintain
compatibility with Unicode.

Because of this some of the specifications for UTF-8 that have been published
allow for U+7FFFFFFF and below to be encoded (U+7FFFFFFF would be encoded as FD
BF BF BF BF BF)[1]. For example RFC 2279 (which is defined in terms of ISO
10646 alone) allows this, but it is obsoleted by RFC 3629 (STD 63) which
references the Unicode standard.

A naïve processor that allowed both over-long sequences and also code points
upto U+7FFFFFFF would treat the six-octet sequence FC 80 80 80 80 AF as an
encoding of U+002F SOLIDUS.

Indeed depending on just how such a processor was getting things wrong (and we
can only specify correct behaviour after all, people are free to get things
wrong whatever way they want :) it's *just* about possible that the seven-octet
sequence FE 80 80 80 80 80 AF would also be treated as U+002F SOLIDUS.

[1]Indeed the format of UTF-8 would make it possible to unambiguously encode
any
value up to 0xFFFFFFFFFF but this exceeds the ISO 10646 codepoint space and it
would break one of UTF-8's design goals in requiring the use of the octet FE.

-- 
Jon Hanna
<http://www.hackcraft.net/>
"…it has been truly said that hackers have even more words for
equipment failures than Yiddish has for obnoxious people." - jargon.txt


This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:26 CDT