character "combinability"

From: spir (
Date: Thu Feb 18 2010 - 08:15:07 CST

  • Next message: Apostolos Syropoulos: "Re: Greek chars encoded twice -- why?"


    Does Unicode specify which characters, especially bases (*), are allowed for combination (into a combining sequence)? For instance, from the ASCII subset, it seems to me only letters can occur in a combination --except for the special case of CR-LF. But I could not find any such restriction list. There may be two cases, imo:

    -1- Either Unicode does not impose any restriction on combination. But then we can and are allowed to concretely encode characters (or rather grapheme) that have no attested existence in real use: for instance, (ASTERISK, COMBINING CIRCUMFLEX). This seems to me contradictory with unicode guidelines, I guess. But opens the door to creative use of unicode ;-)

    -2- Or there are such restrictions. These data should not only specify wich characters can combine absolutely, but also with which class of combining marks there are allowed to do it. Signs like '*' cannot combine at all, probably. ASCII letters can only combine with a given class of diacritics. Actually, this is the case for Hangul syllabs.

    Or is there a kind of implicit gentleman's agreement; meaning combinations should be used in a sensible manner?

    (Where can I find accurate information on this topic?)


    (*) For instance, the algorithm for grouping characters into "grapheme clusters" specifies that "extend codes" be allways grouped with the previous code. This seems to allow any "Combining Mark" arbitrarily be placed on any character (even a non-base one, actually).

    la vita e estrany

    This archive was generated by hypermail 2.1.5 : Thu Feb 18 2010 - 08:18:14 CST