Character:Glyph Ratios

From: James E. Agenbroad (jage@loc.gov)
Date: Fri Oct 15 1999 - 15:53:33 EDT


                                            Friday, October 15, 1999
The following statistics are an attempt to document the 1:several
character:glyph ratios mentioned in my earlier note today. The character
counts are from the published Unicode 2.0 except for Khmer, Myanmar and
Sinhala which are from the May 19, 1999 draft of version 3.0. The glyph
counts are from the glyph repertoires in Monotype's "Worldtype Solutions
Catalogue" obtained at the fall 1998 IUC. For Arabic the presentation
forms from U+FB50 to U+FDFB and from U+FE70 to U+FEFC should be
sufficient evidence. The glyph counts include some stylistic variants,
e.g., for the Delhi and Bombay form of some digits but the overall
1:several ratio seems clear. For Tibetan I suspect that the difference
would be even greater but have no glyph data to confirm it. The recent
message about Charles Wikner's awesome list of 1,046 Sanskrit conjunct
consonants is interesting for testing Devanagari rendering software, but
not directly relevant as many (possibly all) of them can be synthesized
from a smaller number of the glyphs for their constituent parts. The
inclusion of stylistic variants for some of the 1,046 is also impressive.
(With necessary changes it could probably be used to test texts in other
Indic scripts too.)

  Script Begins at Characters Glyphs

Devanagari U+0901 104 374
Bengali U+0981 87 316
Gurmukhi U+0A02 75 112
Gujarati U+0A81 78 371
Khmer U+1780 103 177
Oriya U+0B01 79 225
Tamil U+0B82 61 166
Telugu U+0C01 80 234
Kannada U+0C82 80 241
Malayalam U+0D02 78 404
Thai U+0E01 87 88
Lao U+0E81 65 157
Sinhala U+0D82 80 387
Myanmar U+1000 78 226

Mathematics is far from my favorite subject so this may have some errors,
but the trend seems clear: Except for Gurmukhi, Khmer and Thai these
scripts have a 3:1 or higher character to glyph ratio.
     I'll be away next week and unable to respond to queries, comments or
corrections.
     Regards,
          Jim Agenbroad ( jage@LOC.gov )
     The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT