Re: script complexity, was Re: OpenType vs TrueType

From: Philippe Verdy (
Date: Sun Dec 05 2004 - 10:04:25 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

    > Richard Cook <rscook at socrates dot berkeley dot edu> wrote:
    >> Script complexity is not so easily quantified. Has anyone tried to
    >> sort scripts by complexity? In terms of the present discussion, Han
    >> would be viewed as a simple script, and yet it is "simple" only in
    >> terms of the script model in which ideographs are the smallest unit.
    >> In a stroke-based Han script model, Han is at least as complex as any.

    If Han had not been encoded with a ideograph-based model, may be(?) we would
    have needed much less code points. However the main immediate problem would
    have been that the layout of composite radical and strokes in the
    ideographic square is very complex, highly contextual, and in fact too much
    variable across dialects and script forms to allow a layout algorithm to be
    designed and standardized.

    At least one could have standardized a Han strokes-to square layout system,
    but it would have required a huge dictionnary, requiring many
    dialect-specific sections to handle the variant forms and placement of the
    composing strokes. In addition, the "square" model is not imperitive in Han,
    because there are various styles for writing it, where the usual square
    model is much relaxed, or simply not observed on actual documents.

    To model such variations in a stroke-based model, it would have been needed
    to encode:
    - the strokes themselves (all, not just the radicals!)
    - stroke variants
    - descriptive composition pseudo-characters (like the existing IDC in
    - dialectal composition rules.
    And then to create a very complex specification to describe each ideograph
    according to this model, and allow a renderer to redraw the ideographs from
    such composition grapheme clusters.
    The second problem is that GB* and BigFive encodings already existed as
    widely used standards, but there was no concrete and interoperable solution
    to represent Han characters with such composed sequences.

    This modeling was possible for Hangul, but with a simplification: the
    encoded "jamos" sometime represent several "strokes" (considered as letters,
    also because they have a clear phonetic value, but sometimes grouped within
    the same "jamo" to simplify the design of the Hangul layout system, notably
    for double-consonnant "SANG*" jamos). But a simpler system of jamos was
    still possible (for example it was easy to model the double-consonnant jamos
    as two successive simpler jamos, and then update the Hangul syllable model

    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 10:10:20 CST