Re: Why people still want to encode precomposed letters

From: philip chastney (
Date: Mon Nov 24 2008 - 04:17:05 CST

  • Next message: philip chastney: "Re: Why people still want to encode precomposed letters"

    --- On Sun, 23/11/08, Karl Pentzlin <> wrote:

    From: Karl Pentzlin <>
    Subject: Re: Why people still want to encode precomposed letters
    Cc: "Unicode Mailing List" <>
    Date: Sunday, 23 November, 2008, 10:48 PM

    Am Sonntag, 23. November 2008 um 22:01 schrieb philip chastney:

    pc> A couple of quick questions. First, about how long would the list of
    pc> combinations be?
    pc> if we take 32-ish Latin characters, 24 Greek and 36-ish Cyrillic
    pc> characters, and double that for upper and lower case, we have 144
    potential base characters
    pc> Combining Diacritical Marks (0300~036F) lists 112 characters
    pc> ...
    pc> we can refine that figure
    pc> Latin characters use about 40 marks, Greek perhaps half-a-dozen
    pc> (if we count the cases where 2 marks are used) and Cyrillic about 12
    pc> ( 32 × 40 ) + ( 24 × 6 ) + ( 32 × 12 ) = 1808 potential
    pc> combinations per case, which gives us a tighter limit of 3,600

    If you take into account that:
    - a lot of people (e.g. linguists and writers of North American indigenous
      languages) use to attach 3 diacritical marks onto a base letter,
    - there are "double diacritics" which attach to arbitrary pairs of
    base letters,
    - there possibly will be "triple diacritics" which attach to
      triplets of base letters,
    this number gets somewhat higher.
    not at all
    the figure of ( 32 × 40 ) for Latin lowercase, is an upper limit  --  i.e, it overstates the likely requirement
    where information is sparse, the technique is to set upper and lower limits and try and refine them, to see how close you can get them
    in this case, ( 32 × 40 ) twice  =  2560  --  that's an upper limit
    the number of Latin-based combinations already included in TUS is 500~600  --  that's a lower limit
    note that the lower limit is approximately 20~25% of the upper limit  --  i.e, they are within a decimal order of magnitude
    in this case, the number of double and triple diacritics found in North American indigeneous languages could be 3× the number of composites already included in TUS, without busting that upper limit  --  I think that limit is safe
    note that the double diacritics found in Vietnamese are already included in the 500~600 figure
    you could incorporate an allowance for double and triple diacritics into that first WAG, but I really don't see the point  --  it gives you no useful information

    This archive was generated by hypermail 2.1.5 : Mon Nov 24 2008 - 04:20:25 CST