Re: any unicode conversion tools?

From: Jon Hanna (
Date: Fri May 07 2004 - 11:09:02 CDT

> >> it can be represented in UTF-8 format as:
> >> 1 byte: still 2F
> >> 2 bytes: C0 AF (illegal)
> >> 3 bytes: E0 80 AF (illegal)
> >
> > Thanks for keeping the indication that the last two are illegal with
> > UTF-8. But
> > you should have better never listed them (even if there still exists
> > some legacy
> > converters that will accept them, no one should generate them). Note
> > also that
> > UTF-8 encoded sequences can be up to 5 bytes long...
> How is that possible. I was under the impression that a UTF-8 sequence

> could never be more than 4 bytes (i.e. U+10FFFF becomes F4 8F BF BF).

UTF-8 as defined in Unicode4.0 can never be greater than 4 bytes long. However
illegal sequences can be up to 6 (not just 5) bytes long.

UTF-8 has been variously defined in various standards and specs as an encoding
of either Unicode or of ISO 10646. ISO 10646 has space up to U+7FFFFFFF,
although there is a commitment not to use anything about U+10FFFF to maintain
compatibility with Unicode.

Because of this some of the specifications for UTF-8 that have been published
allow for U+7FFFFFFF and below to be encoded (U+7FFFFFFF would be encoded as FD
BF BF BF BF BF)[1]. For example RFC 2279 (which is defined in terms of ISO
10646 alone) allows this, but it is obsoleted by RFC 3629 (STD 63) which
references the Unicode standard.

A nave processor that allowed both over-long sequences and also code points
upto U+7FFFFFFF would treat the six-octet sequence FC 80 80 80 80 AF as an
encoding of U+002F SOLIDUS.

Indeed depending on just how such a processor was getting things wrong (and we
can only specify correct behaviour after all, people are free to get things
wrong whatever way they want :) it's *just* about possible that the seven-octet
sequence FE 80 80 80 80 80 AF would also be treated as U+002F SOLIDUS.

[1]Indeed the format of UTF-8 would make it possible to unambiguously encode
value up to 0xFFFFFFFFFF but this exceeds the ISO 10646 codepoint space and it
would break one of UTF-8's design goals in requiring the use of the octet FE.

Jon Hanna
"it has been truly said that hackers have even more words for
equipment failures than Yiddish has for obnoxious people." - jargon.txt

This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:26 CDT