Re: Comments on <draft-ietf-acap-mlsf-00.txt>?

From: Timothy Partridge (timpart@perdix.demon.co.uk)
Date: Wed Jun 04 1997 - 14:25:06 EDT


In message <9706041245.AA11103@unicode.org> you recently said:

> > Any comments on
> > ftp://ds.internic.net/internet-drafts/draft-ietf-acap-mlsf-00.txt
> > ?
>
> > Language tags are encoded by mapping them to upper-case, then
> > adding hexidecimal A0 to each octet. The result is broken up into
> > groups of five octets followed by a final group of five or fewer
> > octets. Each group is prefixed by a UTF-8-style length count with
> > the low bits set to 0.
>
> If I have not misunderstood UTF-8 or "MLSF" completely:
>
> A.
> 1. A UTF-8-style length count with the low bits set to 0 is
> **not** an "illegal" UTF-8 "start character code" octet.

I think they are unusual though because the low order bits (except
the highest one) will have at least one bit set becuase of the
character being represented.
e.g 00000yyyyyxxxxxx fits in two bytes. yyyyy is non zero since
otherwise one byte could be used. The bytes are 110yyyyy and 10xxxxxx.
MLSF would use 11000000 which can never occur in UTF-8.

> 2. Adding hexadecimal A0 to the "ASCII" codes for A-Z produces
> something that is an "illegal" UTF-8 continuation octet, but
> *is* a legal "start character code" octet (111xxxxx, where
> each x may be 1 or 0 independently of the others, with some
> exclusions).
>
> I think this would confuse most UTF-8 decoders, and is unlikely
> to be silently ignored.

He may well be assuming an implementation where a count byte triggers
a loop which reads a number of following bytes. As you say there are
other ways of implementing a decoder.

Also attempting to deduce the nature of a byte stream becomes more
complex. (I.e. is it UTF-8, UCS-2, some Japanese standard.)
This makes it harder to have a guess at a file of unknown format.

> B. This trick is designed for UTF-8 only, and does *not* work for
> Unicode/ISO/IEC10646 in general, which means it **cannot** be
> transformed into UTF-16 (nor UCS-4), without using some
> *other* way of representing the language tags.

I agree very strongly. UTF-7 would be used in mail (and news), so
the scheme is not usable.

   Tim



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT