Re: UTF-16 encoding of malformed UTF-8 sequences

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Tue Nov 02 1999 - 04:55:01 EST

Next message: Ashley Yakeley: "Re: UTF-16 encoding of malformed UTF-8 sequences"
Previous message: N.R.Liwal: "Re: Internet Explorer 5, Unicode Fonts, and Fontographer"
Next in thread: Ashley Yakeley: "Re: UTF-16 encoding of malformed UTF-8 sequences"
Maybe reply: Ashley Yakeley: "Re: UTF-16 encoding of malformed UTF-8 sequences"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

John Cowan wrote on 1999-11-01 16:42 UTC:
> Markus Kuhn wrote:
>
> > c) Let's extend UTF-16 to provide an encoding of malformed UTF-8 sequences.
> > For instance, we could define in Plane 14 255 bytes that represent
> > bytes which were part of an illegal UTF-8 sequence.
>
> Better yet, let's use them to represent arbitrary octets! Then we can
> have characters OCTET 00 through OCTET FF, and any binary stuff can be
> embedded in Unicode (at a fourfold increase in size for either UTF-8 or
> UTF-16).

I prefer higher-layer protocols to embed binary stuff with no space
overhead. Why would I want to blow up an JPEG photo by a factor of 4 to
embedd it using your technique?

Sorry for my silly mistake: We'd need of course only 128 code points to
represent the bytes of malformed UTF-8 sequences, because bytes 0x00 to
0x7f are always correct UTF-8 sequences.

I don't think, there is enough space left in the BMP to get 128 extra
codes for such a special purpose. But we don't need these anyway.

I have a *much* better idea:

Let's just represent malformed UTF-8 sequences by malformed UTF-16
sequences (unpaired low surrogates). Add 0xDC00 to every byte of a
malformed UTF-8 sequence to get a proper 16-bit representation in
UTF-16. A UTF-16 low-half zone surrogate in the range U+DC80 to U+DCFF
that is not immediately preceeded by a high-half surrogate shall
represent a byte of a malformed UTF-8 sequence.

How do you like that?

Homework:

Question 1: (quite easy) Since we now have a guaranteed lossless UTF-8 ->
UTF-16 -> UTF-8 roundtrip, can you think of a compatible lossless
UTF-16 -> UTF-8 -> UTF-16 roundtrip that will also preserve all illegal
sequences byte by byte?

Question 2: (more interesting) To what extend can you design these
encodings as homomorphisms with regard to splitting and concatenation of
sequences.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Next message: Ashley Yakeley: "Re: UTF-16 encoding of malformed UTF-8 sequences"
Previous message: N.R.Liwal: "Re: Internet Explorer 5, Unicode Fonts, and Fontographer"
Next in thread: Ashley Yakeley: "Re: UTF-16 encoding of malformed UTF-8 sequences"
Maybe reply: Ashley Yakeley: "Re: UTF-16 encoding of malformed UTF-8 sequences"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT