Re: Caring about European requirements sensitively!

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 22 1997 - 23:16:04 EDT


Alain a écrit :
>
> [Alain] :
> In practice, IBM will create a new EBCDIC code page to have the EURO. It is
> likely to be using the same code position as the one that will be
> "replaced" in Latin 0 (likely at the end the CURRENCY SYMBOL) out of Latin
> 1. It means that if this occur, mappings will still be valid. In the same
> way when Latin 0 will be standardized this new code table will also contain
> the other European characters missing as a defect in Latin 1 (which was
> supposed to support French and Finnish fully but which did not do it). IBM
> is likely to choose the same mapping positions that these characters will
> have in Latin 0.

I have never known IBM to be so cavalier in treatment of its code pages.
We are not talking about *one* new code page here. Each of the Latin-1
converged EBCDIC code pages (and there are quite a number of them besides
CP037 and CP500) would have to spawn off another distinct code page
if it is to be interconvertible with 8859-15 ("Latin-9"). 8859-15
introduces a character repertoire which is distinct from the Latin-1
repertoire ("character set" in IBM terminology). To work with the
existing EBCDIC encodings (*plural*), that repertoire would have to
be combined with new code pages to form a new family of CCSID's
converged with 8859-15 instead of 8859-1. Those new CCSID's would,
it is true, then be completely convertible with 8859-15, but they would
create legacy problems for all the existing CCSID's that are Latin-1
converged.

And nothing about this is automatic. Defining a new CCSID, say CCSID
35915 (for example) that has the 8859-15 repertoire, with the Code Page
500 encoding, but with one-for-one replacements of positions of characters
as 8859-15 replaces 8859-1 characters, doesn't do anything more than
create a new CCSID. No Code Page 500 DB2 databases are going to suddenly
upgrade themselves to use CCSID 35915. And none of the remapping is going
to be transparent or problem-free.

>
> So compatibility will be clean there.

I think it should be clear why I differ with you on that assessment.

> And in practice,
> EURO will have to be interchanged with EBCDIC, not only UNICODE (with
> UNICODE too, of course). So what is being proposed in Latin 0 is clean to
> do all this.

Saying the EURO will have to be interchanged with EBCDIC doesn't make it
happen. Do you really think that all those legacy databases are going
to simply redefine their characters? IBM is going to tell its customers,
oops, sorry, we didn't mean what our CDRA standard says when it defined
all the characters you use in your databases, and we are going to change
their meaning for you? I don't think so. The EURO will have to be introduced
to EBCDIC by introducing new code pages, and then new code pages will
have to be supported on the databases--and all of this will involve
pain in the transition. Not clean at all.

>
> Now those UNIX-8-bit systems that want to implement Latin 0 will be happy
> to be in the same bandwagon.

If they want to implement Latin-9 as yet another alternative character
set along with Latin-1, Latin-2, Latin-3, Latin-4, Latin-5, whatever, then fine.
But if they really think they can painlessly replace Latin-1 with
Latin-9 by swapping in a few new characters for ones that nobody needed anyway,
then I suspect they will turn out not so happy after all.

In my opinion, Latin-1 has been amazingly successful, and has become
effectively the European "ASCII". Yes, it has defects, just as 7-bit
ASCII has always had defects even for accentless representation of English
data. But the campaign to deal with the Euro problem by sweeping away
Latin-1 (and by the way then adding in some French and Finnish characters)
has the potential to wreak IT havoc.

The real consequence of continuing the replace Latin-1 with "Latin-0"
campaign will be to destabilize the interpretation of Latin-1 data,
and result in a de facto situation where everyone depends instead on
Windows 1252 to get it right, at the expense of ISO-compliant 8859-x
based Unix systems. (Is Latin-0 actually just another nefarious
plot by Bill Gates, I wonder? ;-) )

>
> As for the additional few characters in 1252 that are left over (not many!)
> in C1 control space, before we go to system-wide implementation of UNICODE,
> the situation won't be worse than it is today. There exists so far no
> requirement to exchange these characters with EBCDIC data, while there is a
> European requirement to exchange the EURO SIGN, 3 French characters and 4
> Finnish characters more than in Latin 1. When one wants to talk about
> practical things, one has to talk practically. That's what the Latin 0
> proposers have in mind, only practical considerations for the real world of
> today and the 5 coming years at least.
>
> All the destroyers of Latin 0 just have *un*solutions to propose to the
> requirements. They only want a quick fix that is not even a fix to the
> problems exposed and they do not even want to see the problems and try to
> solve them really. They do not care mich really about actual European
> problems much, should I say if I did not know that they also have good
> intentions in mind, of course.

I see two valid non-*un*solutions:

  1. Proceed with 8859-15 (Latin-9) and another part of 8859 (and
     add more parts of 8859 to create corresponding 8-bit standards
     that add the EURO SIGN to the Greek part, the Eastern European part,
     the Turkish part, the Baltic part, ... Deal honestly with the
     data convertibility problems between the new parts and the
     established set of 8859 parts which do not contain the EURO
     (or the French or Finnish characters), and expect to have a
     fairly long transition period of moving from EURO-less 8859
     parts to EURO-ful 8859 parts. A painful transition, but
     well-defined, stepwise, and not plagued with the potential
     problem of catastrophic loss of interpretability of Latin-1
     data. But this non-*un*solution doesn't have a clear path
     to the future. It is the short-term hack to solve the
     immediate problem of the EURO, but doesn't deal with the next new
     character that everyone in Europe decides that they must
     have in their IT systems.

  2. Move the 8-bit systems to 8-bit+ systems, using UTF-8, with
     a drastically constrained repertoire. (Just union the European
     parts of 8859, if you like, and add U+20AC EURO SIGN.) You
     get a small repertoire compatible with European immediate
     needs, without any of the complications of the full 10646
     repertoire that everyone is so afraid of. The fonts will be
     easy to create--just piece them together from the chunks already
     required for the individual 8859-x encodings. Treat everything
     else (except for the actual encoding values) just as you have
     been for 8859-1. This is an incremental solution that has
     a future path. Future additions to the repertoire are simple
     extensions to the interpreted repertoire, without introducing
     coding changes or redefining characters. Fully Unicode compliant
     platforms such as Windows NT or Windows CE or AIX or systems that
     support UTF-8 already such as Solaris, could interwork with this
     trivially. Most systems that can handle Asian data could adopt a
     stripped-down UTF-8 repertoire like this in short order for Europe.

Alain, I am afraid that you have been campaigning so long to eliminate
7-bit constraints in favor of 8-bit clean data, that attaining the
8-bit goal seems a valid place to stop for you. But reality is that
an 8-bit character is not big enough to serve Europe's needs.
The effort to elbow some characters out of one 8859 encoding in
favor of a few different ones does not solve Europe's clear IT needs
(which include Greece, the Czech Republic, Russia, ... --not just
France and Finland).

Try thinking of 8-bit characters as Unicode characters that someone
unfortunately stripped the top 8-bits from, thereby trashing the
<oe>s in French and <z<>s in Finnish. Then we can all get on the
bandwagon to eliminate this odious practice of stripping the high
8 bits and get people to focus on implementing UTF-8 or UTF-16
correctly.

>
> They do not respond to European requirements, whateber their goal is. In
> doing so they are also threatening IBM and its huge installed base of
> mainframes in Europe, I don't know if that is well realized.
>
> Or, what they say that should be done for EBCDIC will generate eternal
> conversion costs (data losses and round-trip integrity violation) for which
> they will be blamed for decades, if I might express it simply. Fortunately
> common sense will prevail and Latin 0 will be standardized.

Hopefully common sense will prevail and 8859-15 (Latin-9) will be seen
as another bump in the long 8859 road towards Unicode/10646
acceptance as the European solution for data representation.

--Ken

>
> Alain LaBonté
> Cornwall (Ontario)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT