RE: A basic question on encoding Latin characters

From: Marco.Cimarosti@icl.com
Date: Fri Sep 24 1999 - 08:59:32 EDT


Hallo.

Maybe I could be that layperson that you are talking about (I don't know how
intelligent, however).

I probably understood just a fraction of what is being discussed;
nevertheless, I wish to ask an actual layperson's question that I have been
wandering about for the last 8 years, and have expert checking my answers.

Real laypersons always start their questions with "Why", in fact the
question is:

        Why were all those pre-composed characters be encoded in the first
place?

I have given myself a few answers, many answers are on Unicode 1.0/2.0 docs,
and many answer may be found in this list. None of them, however, is very
convincing (but I already experienced that, insisting, someone in the list
will explain it flatter and eventually convince me):

        * Compatibility with existing standards?
What does that mean exactly? If Unicode can encode, in a way or another, the
same text that was represented by my old standard, then it is compatible,
no?
Say that, in my old ANSI Latin 1 text file I have an "á" (0xE1, that is, an
"a" with a funny comma-like thing floating upon it). Well, if my new Unicode
file has a method (whatever method) to show me an "a" with the same funny
thing floating on it, I am quite happy, as an user, and I would say that the
new file type is nicely "compatible".
So, this is no explanation.

        * Round-trip conversion to/from existing standards?
Well, I am a layman, but a programmer for a leaving. Imagine they ask me to
write an utility to convert from, say, Latin 1 to/from
Unicode-without-precomposed-characters, and that it has to be as
round-trip-clever as possible.
it is not a such a nightmare for me to code:
        U+0061 U+0301 ==> A+E1
        A+E1 ==> U+0061 & U+0301
Vice versa, if I gotta code Latin 1 to/from Unicode-as-it-actually-is, I
have to code:
        U+0061 U+0301 ==> A+E1
        U+00E1 ==> A+E1
        A+E1 ==> U+0061 U+0301 {or will my boss prefer U+00E1?}
So, having pre-composed AND composing characters does not help round-trip
conversions: it makes them harder or impossible.

        * Accented characters are already on existing keyboards?
Hey, my programmer's editor can write a full C++ "if-else" clause (including
indentation and parentheses) with a single key-stroke! It has already been
pointed out that there's no harm if the "è" key on my Italian keyboard
inserts an "e" followed by a grave accent.
Not a valid reason either, then.

        * Some old-minded representative of some standardization body was
too dummy to understand that "á" is just an "a" wearing a grave-accent hat?
No, no. In my opinion, members of standardization bodies are brilliant
people. And if they were not, they would have been fired, sooner or later.

        * It was faster, for developers, to implement the first Unicode apps
with pre-composed characters?
As an exercise, I wrote a small Unicode editor for poor old MS DOS that uses
combining characters nicely. If I did it, anybody else can do. In fact, the
very first Unicode-enabled application and font that I ever saw (Windows
NT's Notepad and Microsoft's Lucida Sans Unicode font) already had combining
characters.

        * Accented characters glyphs are already in fonts and, thus, they
must have a code?
Hey, but glyphs are glyphs and characters are characters! Font-makers should
come up with their own glyph identification schemes. They should give an
unique ID of their choice to each glyph, then some sort of software
algorithm (possibly embedded in the font itself) will decide how to map one
or more characters to one or more glyphs.
This, too, is not a valid reason.

But, possibly, it deserves more thinking...

Imagine that you are a font architect (many of you don't need much
imagination for this ;-) and that you have to come up with a glyph
identification scheme. It is a hard work, especially if you want it to be
nice job that needs not be done again each time your foundry starts a new
font project.

The glyph ID can be absolutely arbitrary, but it could be a wise choice to
use Unicode values as glyph identifiers, whenever possible.

Most glyphs, in fact, correspond to a single Unicode character, so it is
handy to assign them the same numeric value: the glyph ID for "a" can be
0x61, for "b" can be 0x62, for the Hanzi ideograph "one" 0x4E00, etc.

There is only a minority of Unicode characters that require more than one
glyphs. Devanagari "ra" (U+0930), for instance, interacts in a complex ways
with virama and some vowel signs, so it requires several glyphs, even in the
most naive of fonts. (In such cases, you have to invent something; e.g.
0x010930 = nominal "ra", 0x020930 = subscript "-ra", 0x030930, superscript
"r-"; 0x040930 "r" + vowel "u", etc.). Similarly, there are a few characters
that may use the same glyph in some fonts. A designer could decide that in
his/her Latin/Greek/Cyrillic font Latin "A", Cyrillic "A" and Greek capital
alpha correspond to the same glyph: 0x41.

The credits page of all versions of the Unicode book, show an impressing
number of font designers and vendors. My impression is that all these
font-people that took part in the design of Unicode had in mind, since the
beginning, to use Unicode also as a glyph encoding inside their fonts.

In my mind, all those letter-accent pairs, all those ligatures, all those
"presentation forms" for Arabic and vertical CJK, all those ideograph
variants, etc. are there to allow font designers using Unicode as a glyph
indexing system.

Probably I am wrong thinking this. But, assume that I had the right
impression, what would be wrong with it?

If these precomposed characters (or arbitrarily decomposed ones, as U+0BD7
TAMIL AU LENGTH MARK) are actually needed to font designers, why not
encoding them? Are font designers less important than, say, phoneticians or
chess players?

Fonts are important, and constitute one of the key technologies that (will)
allow developers to support Unicode. Unicode may be as abstract as you like
but the reality is that no Unicode fonts = no Unicode support.

What I am trying to say is that Unicode should pragmatically give up the
"abstract character" concept in some delimited cases, and explicitly admit
that some of the code points are not there to represent "abstract"
characters, but rather to represent "actual" glyphs.

If this distinction is made clear, then everything would fit nicely in its
proper slot: it would become clear(er) that some "characters" are actually
graphemes designed to be used as glyph indexes inside fonts (or inside
rendering algorithms), and that application are not encouraged to use them
to encode text.

In other terms, a thing like U+00E1 ("á") should possibly not be used in a
text file or WP document: it would mainly be there to serve as an index to
font developers that choose to have a separate "á" glyph to render the
U+0061 U+0301sequence.

This would open the door to 3 different things:

        1) Greater relaxation in adding new pre-composed glyphs: if font
designers ask for them, they must have their good reasons. So add them
freely, as far as it is clear that these new "characters" gotta be used
internally by application. This approach would make it painless, e.g., to
add a section with many nice Indic ligatures and half-letters to help Indian
font designers. And it would make clearer the relation between Arabic
"abstract" characters in the block U-06xx block with the Arabic "graphic"
characters in blocks U+FBxx U+FDxx.

        2) The possibility to standardize, up to a certain degree, the
process of transforming a string of "abstract characters" to a string of
renderable glyphs. Of course, some details will always be totally dependent
on the font and required quality. But at least some basic readability
features could be expressed as simple mapping lists, rather than as lengthy
algorithms expressed in natural-language.

        3) Greater relaxation for applicative developers: they would still
be free to nicely display a character like U+00E1, but they would no longer
be blamed if they want to be extremist and show a white box instead.

OK: hope I have been clear and that I have contributed a new chapter for
your "What Public Opinion Thinks Out There" files.

Regards.
        Marco Cimarosti

> -----Original Message-----
> From: Marion Gunn [SMTP:mgunn@egt.ie]
> Sent: 1999 September 23, Thursday 18.50
> To: Unicode List
> Subject: A basic question on encoding Latin characters
>
> If anyone can actually understand the question I plan to set out in
> stages below, would they please do me the great favour of pointing me to
> a URL/scholarly paper containing its answer in the fewest, simplest
> number of words, and employing the expression "industrial
> implementation" at least once.
>
> The question is about Unicode 3.0/10646 (delete as appropriate) and I'd
> be grateful if experts, if they do not know of the existence of such a
> useful URL as I have outlined, would simply respond True/False to parts
> 1-4 of the question, if they consider that enough to satisfy.
>
> 1. I have heard that it argued that there is no reason to encode in the
> UCS or in 10646, _any_ new precomposed Latin combinations as
> single-entity characters. T/F.
>
> 2. I have heard that that is because UCS/10646 is a coded character set,
> rather than a checklist of actual end-user letters (to use the
> layperson's normal understanding of the word "letter"). T/F.
>
> 3. I have heard that most end-users, present company excepted,:-) have
> neither need nor desire to know how such things are coded in the
> UCS/10646, once it can represent the (layperson's) letters needed. T/F.
>
> 4. I have heard that CEN's MES-3, as distinct from related inferior
> subdivisions, contains all the combining characters needed to satisfy
> all of the Latin needs of the layperson to whom statements 2 and 3 above
> apply. T/F.
>
> 5. I have heard that not one of these three mailing lists (Alpha,
> Unicode, 10646) have experts capable of creating such a paper/URL as
> would explain 1-3 and at the same time dovetail neatly into 4 in such a
> way as to satisfy the intelligent layperson who does not want to drown
> in too much of the technical detail, but only learn enough to be able,
> on the basis of answers 1-4, to judge for himself/herself how
> comprehensively UCS/10646/MES-3 meets Latin requirements. T/F.:-)
>
> With best wishes,
> Marion Gunn



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT