> Any comments on
> Language tags are encoded by mapping them to upper-case, then
> adding hexidecimal A0 to each octet. The result is broken up into
> groups of five octets followed by a final group of five or fewer
> octets. Each group is prefixed by a UTF-8-style length count with
> the low bits set to 0.
If I have not misunderstood UTF-8 or "MLSF" completely:
1. A UTF-8-style length count with the low bits set to 0 is
**not** an "illegal" UTF-8 "start character code" octet.
2. Adding hexadecimal A0 to the "ASCII" codes for A-Z produces
something that is an "illegal" UTF-8 continuation octet, but
*is* a legal "start character code" octet (111xxxxx, where
each x may be 1 or 0 independently of the others, with some
I think this would confuse most UTF-8 decoders, and is unlikely
to be silently ignored.
B. This trick is designed for UTF-8 only, and does *not* work for
Unicode/ISO/IEC10646 in general, which means it **cannot** be
transformed into UTF-16 (nor UCS-4), without using some
*other* way of representing the language tags.
C. "Higher level protocols" (e.g. MS-doc/RTF, HTML, etc., etc.)
seems to be a more suitable place for handling language tags
(and is where they are handled now).
IMHO, MLSF should thus **not** be used.
Any opinions expressed are my personal ones, etc., etc., ...
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT