From: Doug Ewell (email@example.com)
Date: Fri Nov 21 2003 - 11:11:15 EST
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> Could an editor loading such incorrect but legacy GB-18030 file accept
> to load it and work with it using an internal-only UCS-4 mapping (or
> an extended UTF-8 mapping), to preserve those out of range sequences,
> as if they were mapped in a extra PUA range?
> Of course saving the file into a UTF encoding would be forbidden, but
> saving the internal UCS-4 file back to GB-18030 would preserve those
> out-of-range GB-18030 sequences, without making any other
> interpretation, and without changing them arbitrarily into the GB18030
> equivalent of U+FFFD?
We talked about this not long ago concerning invalid UTF-8 sequences,
and the same arguments would apply here. Most people agreed that:
(1) There is no particular reason to preserve invalid code unit
sequences, as if they had some kind of paleographic value.
(2) It is not the responsibility of encoding scheme A to provide a
mapping for an invalid sequence in encoding scheme B.
Unless GB 18030 prohibits invalid sequences the way Unicode does, I
suppose there's no reason you couldn't map invalid GB 18030 sequences to
PUA code points *within the privacy of your own application* if you
really want to preserve them in some way, and have some idea what you
want to do with them. You MAY NOT map them to Unicode noncharacters or
anything outside the Unicode/10646 range (i.e. beyond U+10FFFD).
This archive was generated by hypermail 2.1.5 : Fri Nov 21 2003 - 12:02:44 EST