Re: Cultural registry as international standard

From: Paul Keinanen (keinanen@sci.fi)
Date: Fri Sep 18 1998 - 02:26:12 EDT


At 14:41 17.9.1998 -0700, Rick McGowan wrote:

>Oh, sigh... I went to: http://wwwold.dkuug.dk/cultreg/ looking for the Item
>115 that Jony Rosenne mentioned. I found it. Those weird mnemonic names
>are used in the 1st column, which provides machine-readable tokens for
>interpreting the locale tables, and refers back to item 1, the repertoire map
>(which is, of course, not complete with respect to 10646)

I have also had some problems in conceiving the contents of some of these
tables. I wrote a simple program that converted the files to UCS-2 and if
the line started with four hexadecimal digits, prepended the line with the
converted code point value. Using Notepad on Windows NT and experimenting
between the Lucida Sans Unicode font (better for European languages) and
Bitstream Cyberbit font (for non-European characters), I got a quite good
picture what characters are intended.

>The "REAL" names are in the last column. For example, here's an entry:
>
> <////> /x5C <U005C> REVERSE SOLIDUS
>
>That entry is certainly readable, and English.

I also wrote a simple program that converted these tables to UCS-2 and
scanned for the occurance of <Uxxxx> and converted the value and used it to
replace that was between the first pair of <> (in this case <////>), which
made the tables quite readable in an Unicode environment.

>The weird mnemonics are not
>the ONLY mnemonics; but they're used in the actual locale tables like this:
>
> LC_TIME
> abday "<s><u><n>";"<m><a'><n>";/
> "<t><y'><s>";"<m><i><k>";/
> "<h><o'><s>";"<f><r><i'>";/
> "<l><e><y>"
>
>which is pedantic, but if you only have ASCII to express strings of
>characters from a richer set, that's the kind of tiresome act you have to
>play.

I would suggest that two copies of each standard file at the dkuug site
would be maintained, one in plain ISO 646 (current) format and also as UCS-2
in ISO-10646, replacing all those <..> constructions with the corresponding
ISO 10646 code points.

For transmission efficiency, it might be a good idea to use UTF-8 instead of
plain UCS-2.

While the files would still maintained in plain ISO 646, a script would be
used to copy the file to the public directory, the script would also execute
a program that would make a UCS-2 copy of the file and replace those <...>
constructions with the corresponding code point. In such a way, there would
be no extra manual work updating the tables and we could be reasonable sure
that the contents of the two files are in agreement with each other.

While Windows NT and Plan-9 seems to be the only operating systems that
officially support Unicode and can be used in native applications, some web
browsers, especially Netscape has supported UTF-8 (and later also plain
UCS-2) for a while also on other platforms (and if I am not mistaken there
is a large project internationalizing the Netscape to a large number of
platforms). So in my opinnion, it is time to publish those standard tables
also in UCS-2 at least for human consumption (while some automatic
processing may still be better served by the plain ISO646 version).

While maybe only 10 % of the users today may benefit from these new UCS-2
files, I do not think that it will take many years, until there is a 50/50
distribution between the users of the ISO646 and ISO10646 version.

Paul



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT