>> > Any comments on
>> > ftp://ds.internic.net/internet-drafts/draft-ietf-acap-mlsf-00.txt
>> > ?
>> > Language tags are encoded by mapping them to upper-case, then
>> > adding hexidecimal A0 to each octet. The result is broken up into
>> > groups of five octets followed by a final group of five or fewer
>> > octets. Each group is prefixed by a UTF-8-style length count with
>> > the low bits set to 0.
>> If I have not misunderstood UTF-8 or "MLSF" completely:
>> 1. A UTF-8-style length count with the low bits set to 0 is
>> **not** an "illegal" UTF-8 "start character code" octet.
>I think they are unusual though because the low order bits (except
>the highest one) will have at least one bit set becuase of the
>character being represented.
>e.g 00000yyyyyxxxxxx fits in two bytes. yyyyy is non zero since
>otherwise one byte could be used. The bytes are 110yyyyy and 10xxxxxx.
>MLSF would use 11000000 which can never occur in UTF-8.
I guess most UTF-8 converters will encode optimal and only use the
extra bytes where needed. However, most table driven decoders will
not check if those bits are actually used and just start bit-shifting.
Changing that behaviour is easy: just change the relevant positions
in the table to "don't care" values.
>> 2. Adding hexadecimal A0 to the "ASCII" codes for A-Z produces
>> something that is an "illegal" UTF-8 continuation octet, but
>> *is* a legal "start character code" octet (111xxxxx, where
>> each x may be 1 or 0 independently of the others, with some
>> I think this would confuse most UTF-8 decoders, and is unlikely
>> to be silently ignored.
>He may well be assuming an implementation where a count byte triggers
>a loop which reads a number of following bytes. As you say there are
>other ways of implementing a decoder.
>Also attempting to deduce the nature of a byte stream becomes more
>complex. (I.e. is it UTF-8, UCS-2, some Japanese standard.)
>This makes it harder to have a guess at a file of unknown format.
This problem would almost be solved if you add 0x50 instead of 0xA0.
The character '-' still causes problems, but that might be mapped to
'^'+0x50 or '_'+0x50. This results in values of the form 101xxxxx,
which are always correct continuation octets, so old UTF-8 decoders
handle it correct (and probably add bogus symbols to you document).
>> B. This trick is designed for UTF-8 only, and does *not* work for
>> Unicode/ISO/IEC10646 in general, which means it **cannot** be
>> transformed into UTF-16 (nor UCS-4), without using some
>> *other* way of representing the language tags.
>I agree very strongly. UTF-7 would be used in mail (and news), so
>the scheme is not usable.
For that purpose, you could use element of the private zone to
encode the language tags, which would also work for UTF-8 and the
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT