RE: creating a test font w/ CJKV Extension B characters.

From: Philippe Verdy (
Date: Fri Nov 21 2003 - 11:58:09 EST

  • Next message: Frank Yung-Fong Tang: "Re: Request"

    De : Doug Ewell []
    > Unless GB 18030 prohibits invalid sequences the way Unicode does, I
    > suppose there's no reason you couldn't map invalid GB 18030 sequences to
    > PUA code points *within the privacy of your own application* if you
    > really want to preserve them in some way, and have some idea what you
    > want to do with them. You MAY NOT map them to Unicode noncharacters or
    > anything outside the Unicode/10646 range (i.e. beyond U+10FFFD).

    I did not propose to use such map externally. An application or system
    can use whatever internal encoding it thniks may be useful to handle
    legacy cases, even invalid ones, provided that this internal encoding
    is not used to create external data claiming it is Unicode. If that
    module preserves the invalid sequences that were present on its input,
    and provided that the input did not claim to be Unicode (GB18030 is in
    that case), I don't think it violates Unicode conformance, simply
    because there's no Unicode interface on this system.

    Such system could be built explicitly to conform only to GB18030,
    without claiming anything else about Unicode. The internal use of
    Unicode mappings for some sequences, and extra mappings for characters
    or sequences not in Unicode is an internal decision that only influence
    the design of the implementation: Unicode in that case is used as a
    convenient tool to perform some things, but there's no required
    dependency. Using Unicode algorithms or mappings internally just
    eases the implementation of the other encoding.

    The solution that would map invalid sequences into Unicode PUAs may
    have the problem of colliding with other valid PUAs used in GB18030.
    These invalid sequences may as well contain information which is not
    plain-text for Unicode, such as markup or presentation elements, and
    this does not violate the Unicode model used to encode ONLY
    plain-text, and leaving other non-standard uses free for markyp or
    upper-layer protocols.

    So my question remains: does GB18030 permanently binds out-of-range
    or invalid sequences to non-characters? If not, GB18030 applications
    may use them to encode something else than plain-text, and there
    will be a need to map them to extra planes if the internal handling
    of text is best done with a extended Unicode encoding form like

    Another solution could be that GB18030 mandates the mapping of invalid
    sequences to a well-defined set of Unicode PUAs. This would allow them
    to become usable in UTF-16 encoding forms. But as this mapping is not
    done for now, the question of the current assignment of GB18030 invalid
    sequences to non-characters remains open: is the mapping of GB18030
    to Unicode completely closed, or left open for further applications
    like markup (annotation or visual formating and layout, font selection,
    text alternatives, semantic or syntactic data, pointers or links to
    associated information, images, custom bitmap-glyphs, sets of character
    properties, phonetic variants...)?

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Fri Nov 21 2003 - 12:47:03 EST