Re: creating a test font w/ CJKV Extension B characters.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Nov 20 2003 - 17:20:20 EST

  • Next message: Philippe Verdy: "Re: [OT] "Www" as an internet riddle"

    From: "Michael (michka) Kaplan" <michka@trigeminal.com>
    > If you want to test gb18030 support, then please encode a web page in
    > gb18030 and test *that* in the browser of your choice.
    >
    > Now if you want to discuss NCR support then that may also be interesting.
    > But it would be nice to have tests that actually cover what they claim to
    > cover....

    Aren't NCR's supposed to contain ONLY a Unicode code point, even on
    GB18030-encoded codepages?
    Testing a page with NCR will only test Unicode support, not GB18030 support
    even if the Unicode codepoint in the NCR indicates a character in the
    ideographic plane 2...

    To really test GB18030, you need to encode the page with it, without using
    NCRs.
    I.e. you need to know the mapping tables between GB18030 code positions and
    Unicode code points, and implement the ranges table for those GB18030 code
    positions that are algorithmically mapped on Unicode.

    One subsidiary question.
    What is a browser supposed to do if it finds an out-of-range GB sequence
    that is NOT mapped to Unicode? Does GB18030 specify that these sequences are
    now "invalid" (and permanently assigned to non-characters, like U+FFFF in
    Unicode), and not "reserved" for future use (like "unassigned" code points
    in Unicode) ?

    This is critical, because I could fear that some future relase of GB18030
    may assign some functions to these sequences, which will be impossible to
    map onto Unicode, but only onto ISO/IEC-10646 "extra" planes. My worst fear
    is that these sequences could be used to define EUDCS ideographic character,
    using some extra convention that allows encoding glyph forms (or sequences
    of strokes and layout info) and assign them to a PUA, directly within a
    plain-text GB18030 document.

    The alternative to it would be to create a model for grapheme clusters
    adapted to Han ideographs, using ideographic description characters and
    assigning code points to the composite Han strokes that make up the
    ideograph. Then it would become possible to create a normative dictionnary
    between all existing Han ideographs and their composed strokes (with an
    additional benefit as it could allow implementing collation order by stroke
    more easily, using the normative Han description decomposition). This would
    also help unifying new collections of ideographs and avoid duplicate
    assignments for those ideographs that merit a distinct encoding as a single
    code point.



    This archive was generated by hypermail 2.1.5 : Thu Nov 20 2003 - 18:10:41 EST