Re: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences for the Desert alphabet?) from Philippe Verdy on 2017-03-24 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Fri, 24 Mar 2017 20:34:53 +0100

Given the history of characters and the initial desire to be forward
compatible with previous ISO standards, I am convinced that there was no
other choice than preserving the unification, otherwise it would have been
impossible to reliably remap the zillions documents and databases or
applications that were using ISO8859, and other related Windows, MacOS and
IBM codepages for OEMs or for EBCDIC. And with the developement of Internet
and the disire in both Unicode and ISO 10646 to leave the first page of
code points in the UCS and ISO8859-1 fully compatible code for code (and
the fact that there was no variant of ISO8859-1 standardized for Germany,
Switzerland, Austria, Belgium and Luxembourg, that did not request it
(causing nightmares notably in the last three countries, and a lot of
legacy softwares on Windows and MacOS needing such bijective mapping;
finally the Unicode Consortium initially was developed separately from the
IUSO standard and merged later, and at that time, Microsofot and IBM were
the most active members and did not want to introduce incompatibilities and
causing troubles for other vendors).
Later there was a clear statement to keep the basic character properties,
stable, and it became impossisble to change the canonical equivalences
(after the bad experience found when mlerging efforts between Unicode and
ISO notably for enconding Hangul, and a strong initial resistance by China
that wanted to develop its own GB standard).
Encoding stability is now a rule that will be extremely hard to break.

Note: umlauts and diaeresis have not always looked the same, confusion
started lately between both during the middle of the 20th century and the
starting development of computing. It would have been impossible to reach a
large adoption of the UCS without such compromizes (and it took additional
years after both projects joined their efforts, before ISO finally closed
its working group on legacy 8-bit character sets, and stopped accepting any
new variants; ISO 8859-15 was one of the last failed attempt to standardize
a new 8-bit encoding, that finally almost nobody really used as they no
longer needed it; China resigned as well and finalized the roundtrip
mapping of its GB 18030 competing encoding with the UCS, so mappings for GB
18030 no longer needs new updates: any new encoding in the UCS is
immediately encoded as well in GB without modifying any line of code or
data, and any software or document compatiblle with the UCS should be
imediately compatible with the GB 18030 standard required in PR China; I
don't know if Hong Kong authorities made the same statement for its HKCS
standard before it reunified with China, or if Taiwan made a similar
decision; however Japan is adding new encodings in its JIS standard, pushed
by national vendors, and the UCS still has delays for accepting these
additions and not all is accepted, but in this area, there's a local
subcommity constantly negociating with Asian vendors and reporting its
efforts to Unicode and ISO).

About umlauts and diaeresis I'm not sure they were always looking the same.
If we try to encode old German, Hungarian or Czech texts, we may find some
discrepencies or ambiguities (but there's still no mechanism to distinguish
when an umlaut is really desired and a diaeresis is destired instead if
they don't look the same in historic script variants). We cannot encode
these using "variants" but possibly we may be using some combining controls
such as CGJ (encoded after the precombined letter or after the base
letter+diaresis, because of canonical equivalences it cannot be in the
middle). Or may be, only for historic texts, we could add a combining
lowercase e as an alternative to the existing diaeresis.

2017-03-24 19:33 GMT+01:00 Doug Ewell <doug_at_ewellic.org>:

> Philippe Verdy wrote:
>
> > But Unicode just prefered to keep the roundtrip compatiblity with
> > earlier 8-bit encodings (including existing ISO 8859 and DIN
> > standards) so that "ü" in German and French also have the same
> > canonical decomposition even if the diacritic is a diaeresis in French
> > and an umlaut in German, with different semantics and origins.
>
> Was this only about compatibility, or perhaps also that the two signs
> look identical and that disunifying them would have caused endless
> confusion and misuse among users?
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
Received on Fri Mar 24 2017 - 14:35:40 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 24 2017 - 14:35:40 CDT