Re: A basic question on encoding Latin characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 23 1999 - 17:14:16 EDT


Well, the panel of experts (Markus, Michael, and Rick) has answered,
and almost agreed. Herewith my contribution.

>
> If anyone can actually understand the question I plan to set out in
> stages below, would they please do me the great favour of pointing me to
> a URL/scholarly paper containing its answer in the fewest, simplest
> number of words, and employing the expression “industrial
> implementation” at least once.

As Rick pointed out, the issue you are getting at is somewhat
indirectly hinted at. If the issue is, as many have wondered, whether
the Unicode Standard really is "complete" for Latin, despite the
non-appearance in its charts of <name your favorite accented Latin
letter for Foovian here>, then the answer has been stated in the
Unicode Standard since version 1.0 (p. 10, 1991):

   "The Unicode standard allows dynamic composition of accented forms..."

We have been saying this for 10 years now (8 in print)--it is just
that until recently many people just simply refused to believe
that software could actually work that way.

Now with Office 2000 (an "industrial implementation", if there ever
was one), new versions of the Mac OS, and other software coming out,
we are faced with the situation that soon most mainstream software
*will* work that way (although the Unixes are still catching up). It
will get harder and harder for the doubting Thomases to keep claiming
that people will never build machines that can fly. Don't expect them
all to be transcontinental liners the first year right out of the
bicycle shop, but you may be surprised at how quickly people will
get used to it all just working right -- just as they now board
jet airplanes without thinking too much about how 350 tons of
machinery consisting of half a million parts manages to get off
the ground, 6 miles high and moves at 620 miles per hour -- reliably.

>
> The question is about Unicode 3.0/10646 (delete as appropriate) and I’d
> be grateful if experts, if they do not know of the existence of such a
> useful URL as I have outlined, would simply respond True/False to parts
> 1-4 of the question, if they consider that enough to satisfy.
>
> 1. I have heard that it argued that there is no reason to encode in the
> UCS or in 10646, _any_ new precomposed Latin combinations as
> single-entity characters. T/F.

As for all good questions, the answer is T *and* F. People do have
good reasons to encode more precomposed Latin letters. Completion of
partial alphabetic sets, ease of use with simpler rendering systems,
and one-to-one conversion simplicity against data that may be coded
that way in an 8-bit set are several examples.

The problem is that none of those reasons are good *enough*. And with
the introduction of Unicode 3.0 and the way it is tied to normalization
forms that soon will become ubiquitous on the Internet, the associated
costs for encoding new precomposed characters have risen steeply,
and the associated benefits have been lessened (since they are going
to end up decomposed in the normalization form seen on the Internet
anyway).

Thus with the shiny new millenium being ushered in with brand spanking
new editions of the Unicode Standard *and* of ISO/IEC 10646-1, neatly
in synch, the balance in the argument for more precomposed Latin
combinations has shifted strongly against encoding them as units.

>
> 2. I have heard that that is because UCS/10646 is a coded character set,
> rather than a checklist of actual end-user letters (to use the
> layperson’s normal understanding of the word “letter”). T/F.

Also True and False, which is why you got both answers. It is true
that the Unicode Standard is not intended as an inventory,
registration, or "checklist" of end-user letters (what we call
"graphemes" -- the explicit units of orthographies). It is an
encoding of "characters", which are abstractions. Think of it as
an Erector Set (remember those? -- if not, then think Lego Set)
of pieces needed to build what the user sees as their "letters".
In some cases you bolt the frazmus onto the whatzit before you
hand the whole thing over to the user for their whatchamacallit.
Sometimes the whatchamacallit is just a single part in the set.
In any case, they get what they want in the end.

However, as Michael pointed out, the biggest *reason* why no more
precomposed Latin characters should be encoded is not the distinction
between a character encoding and a registration list of graphemes.
If that were the case, there never would have been any reason to
put precomposed Latin accented characters into the Unicode Standard
in the first place. No, the main reason for "no more hereafter" is
the impact of the normalization algorithm on Unicode data.

>
> 3. I have heard that most end-users, present company excepted,:-) have
> neither need nor desire to know how such things are coded in the
> UCS/10646, once it can represent the (layperson’s) letters needed. T/F.

Well, this one, at least, I'd have to say is pretty much
unambiguously true. *Most* end-users neither need nor desire
to know how *anything* in their computers works, any more than
they desire to know how their carburetors and ignition systems
work in their cars.

>
> 4. I have heard that CEN’s MES-3, as distinct from related inferior
> subdivisions, contains all the combining characters needed to satisfy
> all of the Latin needs of the layperson to whom statements 2 and 3 above
> apply. T/F.

True.

>
> 5. I have heard that not one of these three mailing lists (Alpha,
> Unicode, 10646) have experts capable of creating such a paper/URL as
> would explain 1-3 and at the same time dovetail neatly into 4 in such a
> way as to satisfy the intelligent layperson who does not want to drown
> in too much of the technical detail, but only learn enough to be able,
> on the basis of answers 1-4, to judge for himself/herself how
> comprehensively UCS/10646/MES-3 meets Latin requirements. T/F.:-)

False. But of course this depends on what it takes to "satisfy" that
hypothetical intelligent layperson. That hypothetical layperson is
already pretty clued-in if they have even heard of "MES-3" -- note
the "I dunno" responses from some of the people on this list, and
they are experts on character encoding issues.

And you also need to consider that the person who actually needs to
be satisfied is usually a procurement officer. They won't be satisfied
by a chatty little paper explaining it. They are more likely to be
satisfied if the standards organizations, major players in the
IT industry, and the analysts whose newsletters they subscribe to
declare it to be so. It is far easier to accede to a unanimous
assertion of the truth than to bother understanding all the messy
details behind the assertion.

--Ken

>
> With best wishes,
> Marion Gunn
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT