RE: creating a test font w/ CJKV Extension B characters.

From: Philippe Verdy (
Date: Fri Nov 21 2003 - 09:12:26 EST

  • Next message: Doug Ewell: "Re: creating a test font w/ CJKV Extension B characters."

    De: Andrew C. West
    > (Unfortunately I've just noticed that BabelPad has a slight
    > bug with out of range GB-18030 values such as
    > <E3 32 9A 36> = U+110000.)

    Could an editor loading such incorrect but legacy GB-18030 file accept to
    load it and work with it using an internal-only UCS-4 mapping (or an
    extended UTF-8 mapping), to preserve those out of range sequences, as if
    they were mapped in a extra PUA range?

    Of course saving the file into a UTF encoding would be forbidden, but saving
    the internal UCS-4 file back to GB-18030 would preserve those out-of-range
    GB-18030 sequences, without making any other interpretation, and without
    changing them arbitrarily into the GB18030 equivalent of U+FFFD?

    The editor could still use the Unicode rules for all valid GB18030
    sequences. And the invalid characters could be then represented for example
    with a colored/highlighted glyph such as <U+110000>. As both the input and
    output are not a Unicode scheme, I don't think this invalidates the Unicode
    conformance: the behavior would just be conforming to GB18030 or other
    legacy GB PUAs mappings.

    Of course this editor will not be able to work on this text if its internal
    encoding form is UTF-16, unless the editor uses aditional internal markup or
    storage of GB sequences that were were mapped in the edit buffer as an
    0xFFFD UTF-16 code unit. This "augmented text" with annotated values for
    U+FFFD present in the text would then not be handled as if it was only
    Unicode plain-text, but can constitute what Unicode calls an upper-layer
    protocol, that is used to keep the original code sequences used in a
    non-Unicode charset encoding and have no clear equivalent in Unicode.

    The same thing could be used for example to map the "Apple logo" registered
    character in files coded with MacRoman, instead of remapping it to a weakly
    interchangeable PUA: the out-of-band annotation of U+FFFD in the plain-text
    part of the edited file would keep the track of the origin encoding of this
    character, and the file may then be transmitted either in a latered form
    with a UTF, or by using some other text encapsulation format: for example a
    XML named entity (like "&apple-logo;") or a <char encoding="MacRoman"
    bytes="XX"/> element, or a <img> reference (in HTML files).

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Fri Nov 21 2003 - 09:50:03 EST