Re: Unicode 6.2 to Support the Turkish Lira Sign

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 31 May 2012 08:29:39 +0200

2012/5/30 "Martin J. Dürst" <duerst_at_it.aoyama.ac.jp>:
> On 2012/05/30 4:42, Roozbeh Pournader wrote:
>
>> Just look what happened when the Japanese did their own font/character set
>> hack. The backslash/yen problem is still with us, to this day...
>
>
> To be fair, the Japanese Yen at 0x5C was there long before Unicode, in the
> Japanese version of ISO 646. That it has remained as a font hack is very
> unfortunate, but for that, not only the Japanese, but also major
> international vendors are to blame.

As long as it was part of the Japanese version of ISO 646 (which
itself was only the first page of the SJIS encoding), there was
absolutely NO problem at all. This was not different from the
situation of all other national versions of ISO 646, which were all
distinct encodings.

The situation became a problem when the Japanese ISO 646 started to be
mapped to Unicode/ISO/IEC 10646 within fonts using incorrect mappings.
This occured in the early stages of ISO/IEC 10646 development.

And unfortunately several OSes for Japan used those incorrect
mappings, assuming that it was still safe to convert blindly texts
containing backslashes by showing yen symbols instead, just like the
same systems blindly converted US-ASCII (American version of ISO 646)
into SJIS with broken algorithms, simply because those softwares could
not really work with Unicode but still worked only with SJIS, and did
not track correctly which source encoding was used.

This would have probably not occured if Japan had defined and
standardized an ISO 8859 version for mapping the Yen out of ASCII
(along with basic Kana letters and Asian punctuations); but they
prefered to develop only SJIS to support Kanjis (and later the
emerging UCS remapped on it). And it would also have offered an easier
migration.

They were ambitious at the beginning, but the ambition was premature
when the surrounding technologies to support a large character set was
still very incomplete (forcing a lot of software to use unsafe/lossy
remappings to a smaller character sets). So for several decennials,
there has been a lot of interoperability problems caused by the
various implementations of SJIS, many of them not compatible with each
other in their limitations or in the way the "simplifications" were
applied to support different parts of it.

The backslash character, though it was common in many programming
languages and OSes, then appeared to be replaced there by the yen
symbol, and people were trained with it (for example when using
pathnames in DOS/Windows filesystems, or when using the yen symbol as
the escaping prefix when programming in C/C++); and it was then
perceived that the backslash was for them a variant form (of their yen
symbol) that they did not need (SJIS was later adapted to map the
backslash somewhere else, but the SJIS users did not immediately fix
it).

As a result, the mapping of 0x5C in SJIS has always been ambiguous,
depending on the implementations, but it has never been ambiguous in
the Japanese version of ISO 646, that did not include the backslash.

So don't criticize ISO 646, there was no problem there. The problem is
fully within the early versions of SJIS which allowed such variation
of glyphs, when it should have considered the yen symbol and the
backslash as distinct abstract characters requiring separate mappings.

But who uses the Japanese version of ISO 646 now in Japan ? Only SJIS
seems to survive now, with all its intrinsic ambiguities and its many
incompatible implementations (whose exact versions are most often not
identified correctly in most softwares).

The Japanese NB should have stopped this nightmare by fixing a rule to
strongly deprecate (and remove all past recommandations), so that only
one version of SJIS should survive, and that old data encoded with
ambiguous SJIS version being left in their blackbox :

It would have been simpler and more effective for the Japanease NB to
rename the SJIS standard for the only remaining version, such as
"UJIS" ("U" for "Universal", meaning that it has a full roundtrip
compatibility with the UCS and no longer any ambiguity allowed) and
then freeze it completely at this state (all other developments being
made in the UCS), with a strong recommandation to NOT perform any
blind conversion to UJIS or interpretation as UJIS of any past data
encoded for an unversioned SJIS : all ambiguous characters in these
old data should be detected as ambiguous, meaning that the
document/data was not convertible without proper versioning.

This would have forced also the various private software makers and
manufacturers that had used their own version of SJIS to register
again to the Japanese NB a SINGLE (and unique) string recommanded to
identify their implementation of SJIS, removing all past known aliases
that were also ambiguous between each other, so that the effective
encofing old data could be uniquely identified and would then become
uniquely convertible first to the national standard UJIS, then to the
UCS by its warrantied roundtrip compatibility.
Received on Thu May 31 2012 - 01:34:50 CDT

This archive was generated by hypermail 2.2.0 : Thu May 31 2012 - 01:34:55 CDT