This is incorrect about UTF-8; if you try to use "unoccupied" bits, then
receivers will get the wrong result; they will either throw an exception
or they will interpret the result as a character; just the wrong one.
Look at pages A-1 through A-11 of the Unicode standard.
Unicode Discussion wrote:
> > Any comments on
> > ftp://ds.internic.net/internet-drafts/draft-ietf-acap-mlsf-00.txt
> > ?
> > Language tags are encoded by mapping them to upper-case, then
> > adding hexidecimal A0 to each octet. The result is broken up into
> > groups of five octets followed by a final group of five or fewer
> > octets. Each group is prefixed by a UTF-8-style length count with
> > the low bits set to 0.
> If I have not misunderstood UTF-8 or "MLSF" completely:
> 1. A UTF-8-style length count with the low bits set to 0 is
> **not** an "illegal" UTF-8 "start character code" octet.
> 2. Adding hexadecimal A0 to the "ASCII" codes for A-Z produces
> something that is an "illegal" UTF-8 continuation octet, but
> *is* a legal "start character code" octet (111xxxxx, where
> each x may be 1 or 0 independently of the others, with some
> I think this would confuse most UTF-8 decoders, and is unlikely
> to be silently ignored.
> B. This trick is designed for UTF-8 only, and does *not* work for
> Unicode/ISO/IEC10646 in general, which means it **cannot** be
> transformed into UTF-16 (nor UCS-4), without using some
> *other* way of representing the language tags.
> C. "Higher level protocols" (e.g. MS-doc/RTF, HTML, etc., etc.)
> seems to be a more suitable place for handling language tags
> (and is where they are handled now).
> IMHO, MLSF should thus **not** be used.
> /kent karlsson
> Any opinions expressed are my personal ones, etc., etc., ...
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT