Codepoint Differentiation

Date: Mon Feb 21 2005 - 06:38:52 CST

  • Next message: Erik van der Poel: "Re: [idn] IDN spoofing"

    Content-Type: text/plain; charset=us-ascii
    Content-Transfer-Encoding: 7bit


    I've been pondering the concept of using some kind of "differentiators" to
    define sub-meanings for codepoints. I see some discussion in the list archives
    of Variation Selectors (which I was thinking of). But it sounds like there are
    some problems with using them (or at least so with combining codepoints). I'm
    afraid the technical details are beyond me.

    So what is the current status with this subject?

    Is there actually any problem with using Variation Selectors as-is to
    differentiate non-combining characters -- such as these applications:

     - Serbian Cyrillic Small "t"
     - Coptic letterforms for Greek letter codepoints
     - complete Archaic Greek and Asia Minor scripts aligned to Greek letter codepoints
     - several functions for two-directional case change with German Sharp S
     - alternate CJK ideographs and syllabographs
    Also, is it possible to redefine the behavior of Variation Selectors so that
    they could be used with combining marks, or create a new class of
    "differentiator" codepoints that could be used with combining marks instead?
    Some applications for this:

     - umlaut vs. diaeresis
     - "low acute" vs. "high acute"
     - Greek circumflex (perispomeni) in Greece vs. West
     - Greek capital letters with subscript (Greece) vs. adscript (West)
     - alternate Indic ligatures

    I can elaborate on most of these points on request, especially umlaut vs.
    diaeresis which you think has been solved with the CGJ but still has vital problems.

    In all cases I think it's essential that whatever is done is an official,
    mandatory assignment, visually and textually documented in the main glyph
    charts. The whole point of this should be that every smart font and keyboard
    map in the world reliably implements the system as a standard.

    As I say, I don't really understand the technical issues of decomposition and
    sorting and so forth, but this seems to be a fairly straightforward concept:

     - all differentiators are placed after the thing (letter, mark) they modify,
    and are only a characteristic of that thing, containing no information on the
    relationship of the thing to anything else

     - Unicode can add a behavior definition for a specific assigned combination
    of thing + differentiator which all processing systems should implement

     - otherwise without a specific behavior definition for the combination, most
    processing just ignores a differentiator

     - a smart font though may always detect pairs of codepoint + differentiator
    and take some action

     - also a specialized database can can choose to take a specific codepoint +
    differentiator pair into account for sorting, searching or some other purpose

     - for precomposed characters, specific precomposed characters +
    differentiator combinations can be assigned specific decomposition rules that
    define how the decomposed letter + mark(s) + differentiator(s) end up

     - or if that's too complicated to implement, precomposed characters can just
    be excluded from use with differentiators (we'll all be switching to pure
    combining Unicode soon anyway, right?)

    Thanks for your education and feedback on this subject.


    This archive was generated by hypermail 2.1.5 : Mon Feb 21 2005 - 13:53:31 CST