Encoding of non-characters

From: Doug Ewell (dewell@compuserve.com)
Date: Sat Jul 29 2000 - 12:02:03 EDT


Did I read recently (in a message that I shortsightedly deleted)
something to the effect that a character encoding scheme (CES) or
transfer encoding syntax (TES) needs to be able to encode the non-
characters U+D800 through U+DFFF, and presumably U+xxFFFE and U+xxFFFF
as well?

I've been playing around with a TES (or maybe it's a CES; I'm still
having a little trouble knowing exactly where to draw the line). Don't
worry, I'm not going to propose it anywhere as Yet Another UTF. I'm
just playing around with Unicode, and hopefully teaching myself
something along the way.

Anyway, my scheme encodes non-BMP characters not *as* surrogates, but
using the surrogate mechanism in a slightly modified way. Like UTF-16,
this makes it impossible to encode the BMP non-characters in the range
U+D800 through U+DFFF. Normally I wouldn't think this was a problem,
but I thought someone (Davis?) just said recently that it should be
possible to round-trip these thingies, for some reason.

The situation would be different in the case of U+xxFFFE and U+xxFFFF,
because while the surrogates occupy entire ranges that can be utilized
in a special way, you kind of have to *deliberately* exclude the FFFx
characters. Nonetheless, the same question applies: Must these bogus
code points be representable in a CES or TES, or can they be handled
conformantly by raising an error or mapping them to U+FFFD?

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT