Re: A basic question on encoding Latin characters

From: Michael Everson (everson@indigo.ie)
Date: Thu Sep 23 1999 - 13:47:34 EDT


Ar 09:49 -0700 1999-09-23, scríobh Marion Gunn:
>If anyone can actually understand the question I plan to set out in
>stages below, would they please do me the great favour of pointing me to
>a URL/scholarly paper containing its answer in the fewest, simplest
>number of words, and employing the expression "industrial
>implementation" at least once.

Bheadh ortsa an cheist a chur chugamsa ar dtús! :-) You should start with
the Unicode Technical Report #15, Unicode Normalization Forms, at
http://www.uniode.org/unicode/reports/tr15/tr15-17.html. It is a 20 page
technical description of normalization by Mark Davis and Martin Dürst.

>1. I have heard that it argued that there is no reason to encode in the
>UCS or in 10646, _any_ new precomposed Latin combinations as
>single-entity characters.

True. The chief reason for doing so in the past was compatibility with
existing standards. The chief reason for wanting to do so now is
font-related.

>2. I have heard that that is because UCS/10646 is a coded character set,
>rather than a checklist of actual end-user letters (to use the
>layperson's normal understanding of the word "letter").

False. It's because the internet is now beginning to implement
normalization, and those tables can't be updated usefully without great
cost to implementors in future. Every time you add a new character you
break the algorithms.

>3. I have heard that most end-users, present company excepted,:-) have
>neither need nor desire to know how such things are coded in the
>UCS/10646, once it can represent the (layperson's) letters needed.

True. The underlying encoding shouldn't be of concern. The normalization
algorithms will enable searching for <á> and <a´> just fine. The end user
is really only concerned with inputting, displaying, processing and
printing. Not encoding. The font issue is still painful (for small vendors
at least) but one remains hopeful.

>4. I have heard that CEN's MES-3, as distinct from related inferior
>subdivisions, contains all the combining characters needed to satisfy
>all of the Latin needs of the layperson to whom statements 2 and 3 above
>apply. T/F.

True. The (as yet unapproved) MES-3 contains all the letters belonging to
the Latin script, plus all the combining characters. So if you needed a
LATIN CAPITAL LETTER THORN WITH DIAERESIS ABOVE AND COMMA BELOW, you could
compose one (U+00DE U+0308 U+0326). If you wanted LATIN SMALL LETTER E WITH
MOOSE-ANTLERS ABOVE, you could not, because there is (as yet) no COMBINING
MOOSE-ANTLERS ABOVE encoded in the UCS.

>5. I have heard that not one of these three mailing lists (Alpha,
>Unicode, 10646) have experts capable of creating such a paper/URL as
>would explain 1-3 and at the same time dovetail neatly into 4 in such a
>way as to satisfy the intelligent layperson who does not want to drown
>in too much of the technical detail, but only learn enough to be able,
>on the basis of answers 1-4, to judge for himself/herself how
>comprehensively UCS/10646/MES-3 meets Latin requirements.

False. You shouldn't listen to hearsay. :-)

--
Michael Everson * Everson Gunn Teoranta * http://www.indigo.ie/egt
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Guthán: +353 1 478 2597 ** Facsa: +353 1 478 2597 (by arrangement)
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT