Wednesday, May 30, 2001
Attached is a note I wrote in September 1993 about the ratio of characters
to glyphs in several Indic scripts. Much has changed on the Unicode
front since then, but I think the need for rendering software to decide
which of many glyphs to use to represent a given sequence of codes is
still with us. A similar situation obtains with Arabic--unless one
requires the use of Arabic presentation forms. If one excludes the
combining characters at U+0300 to 0362 European scripts tend to have a 1:1
character to glyph ratio; Chinese, Japanese and (maybe Korean) scripts
also tend to have a 1:1 character to glyph ratio. But most scripts
between Europe and the Far East--Arabic, South and Southeast Asian ones do
not. Unless the rendering software and the fonts are in synch the results
will be unsatisfactory. A few posting on the 'single font' discussion
have mentioned this but I hope some data may be helpful.
The story goes that back in Ancient Greece (I think) someone was
describing Utopia and a listener asked, "But who will do the work?" and
the reply was, "Oh, we will have slaves." The computer now can be an
effective slave when given explicit instructions, but without consistent
instructions the results will not be satisfactory.
This may be beyond the scope of Unicode which aims to unambiguously
encode text for the computer (and succeeds) but does not dwell on details
of its input or output--rendering it legible for humans to read.
Jim Agenbroad ( jage@LOC.gov )
The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.
---------- Forwarded message ----------
Date: Fri, 10 Sep 93 14:12:07 -0400
From: jage (James E. Agenbroad)
Subject: Some Character to Glyph Statistics
Friday, September 10, 1993
Recent Internet discussions about fonts for ISO10646/Unicode prompted
me to do some counting. The data are suggestive rather than definitive
at least in part because the counts of glyphs are based on only a single
source and it may not be up to date. They do suggest that for various
writing systems of South (and maybe Southeast) Asia based on Indic scripts
the ratio of coded characters to glyphs is not 1:1 but 1:2 or even 1:3.
I'm sure this is no surprise to you but the Internet discussions make no
meniton of it so I thought I would. When a writing system has more glyphs
than characters I think there must be software to decide when which glyph
is wanted. (This software may also need to know something about the
target device but that's not an issue I can shed any light on.)
As a preliminary assessment I have counted the number of character
codes ISO 10646 assigns for several writing systems and the number of
glyphs from synopses of the same writing systems as found in "Specimen
book of 'Monotype" non-latin faces" issued loose-leaf by Monotype
Corporation. I geve the number and date of each sheet. In counting
I have omitted western style punctuation and numerals.
Writing System, date 10646 Mono. Rough
chars glyphs ratio
Bengali 470,5/65 89 331 1:3
Burmese 558,5/64 76 213 1:3
Devanagari155,8/75 104 248 1:2.5
Gujarthi 460,7/71 75 232 1:3
Gurmukhi 601,9/74 74 146 1:2
Kannada 588,9/69 80 236 1:3
Malayalam 590,7/75 78 590 1:7
Oriya 706,3/70 78 371 1:4
Sinhalese 557,1/64 90 348 1:3.5
Tamil 280,1/64 61 171 1:3
Telugu 626,3/71 80 312 1:4
Thai 577,4/74 92 208 1:2
Tibetan (Van Osterman) 80 158 1:2
For Sinhalese and Tibetan (not in 10646 yet) the count is from Unicode
Technical report no. 2. For Devanagari and Gurmukhi has a note: "A
special mould is required for these matrices". THe relation of these
fonts to current systems is unclear. As noted, my Monotype book does
not include Tibetan, the glyphs are from George Vvan Ostermann's
"Manual of foreign languages" 4th ed. 1952--Icounted the leters, ligtures,
numerals, vowel signs and punctuation.
I would also like to expres my agreement with the man from New South Wales
who said libraries will need to display lots of different characters. I
do not know if this means one large font or m any so long as they are
all available when needed to display a string of ccharacter codes--without
the recipent knowing what will be needed and taking extra measures to
load the proper font. The fonts for such purposes would not need to have
extremely high resolution, maybe 24 dots high per character.
Feel free to forward this to others so long as the cautionary note is
included--this is a purely personal opinion, not an official policy of
any government or agency of any.
Jim Agenbroad (email@example.com )
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT