John Cowan wrote on 1999-11-01 16:42 UTC:
> Markus Kuhn wrote:
> > c) Let's extend UTF-16 to provide an encoding of malformed UTF-8 sequences.
> > For instance, we could define in Plane 14 255 bytes that represent
> > bytes which were part of an illegal UTF-8 sequence.
> Better yet, let's use them to represent arbitrary octets! Then we can
> have characters OCTET 00 through OCTET FF, and any binary stuff can be
> embedded in Unicode (at a fourfold increase in size for either UTF-8 or
I prefer higher-layer protocols to embed binary stuff with no space
overhead. Why would I want to blow up an JPEG photo by a factor of 4 to
embedd it using your technique?
Sorry for my silly mistake: We'd need of course only 128 code points to
represent the bytes of malformed UTF-8 sequences, because bytes 0x00 to
0x7f are always correct UTF-8 sequences.
I don't think, there is enough space left in the BMP to get 128 extra
codes for such a special purpose. But we don't need these anyway.
I have a *much* better idea:
Let's just represent malformed UTF-8 sequences by malformed UTF-16
sequences (unpaired low surrogates). Add 0xDC00 to every byte of a
malformed UTF-8 sequence to get a proper 16-bit representation in
UTF-16. A UTF-16 low-half zone surrogate in the range U+DC80 to U+DCFF
that is not immediately preceeded by a high-half surrogate shall
represent a byte of a malformed UTF-8 sequence.
How do you like that?
Question 1: (quite easy) Since we now have a guaranteed lossless UTF-8 ->
UTF-16 -> UTF-8 roundtrip, can you think of a compatible lossless
UTF-16 -> UTF-8 -> UTF-16 roundtrip that will also preserve all illegal
sequences byte by byte?
Question 2: (more interesting) To what extend can you design these
encodings as homomorphisms with regard to splitting and concatenation of
-- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT