From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue May 08 2007 - 00:51:46 CDT
These long messages with interspersed quotes make for colorful reading,
but obscure the flow of the argument. I'll try to lay it out one more
time, in order, but with an eye towards answering some of the recent points:
In German orthography (especially after the recent reform), there is a
clear distinction between ß and "ss" in lower case. There are some word
pairs where it's the only distinction. (The same is true for some
personal names).
For purposes of searching personal names, and for sorting words, it's
expedient to suppress that distinction. Part of that (probably) has to
do with the fact that spellings of personal names cannot be predicted by
sound, and sorting similar sounding names together is generally useful.
Pre-reform, the ß and ss were used in distinction, but in ways that was
not as clearly related to pronunciation of the word. Ergo, sorting words
had the same issues as sorting personal names. However, sorting and
searching are special in that they often create fairly wide equivalence
sets, compared to the distinctions needed in accurately writing text or
names.
The origin of the SS case mapping for ß is not actually known with
certainty. However, it was decreed in a time where the use of Fraktur
and typewriters were common. Typewriters had extreme limitation in the
number of signs they could support, and ALL UPPERCASE text in Fraktur is
an absurdity. Since the ß does not (ordinarily) occur in TitleCase,
which is very common in German (nouns), the impact of the standard
orthographic rule is limited.
Nevertheless, the post office (on forms), sign writers, certain name
registries, and many other users that use ALL UPPERCASE text (in modern
style, not Fraktur), feel that suppressing the distinction between words
and names that contain ß and those that contain 'ss' is not appropriate.
There are three ways this distinction can be maintained in ALL UPPERCASE
text. Use of SZ, retaining lower case ß as-is and using an uppercase
form of ß. All three forms can be found. And all three ways have their
adherents. Yes, that means that Germany is not united after all. ;-).
For the following argument, it is important to not conflate either of
these three forms with the standard orthography, which does equate ß
with "SS" in ALL UPPERCASE text. The standard orthography is the *only*
one, that (outside sorting and searching) allows an equivalence between
SS as uppercase of ß and SS as uppercase of "ss", (while simultaneously
distinguishing carefully between their lower case forms).
If you desire to carry the distinction between ss and ß in lower as well
as in uppercase, for semantic reasons, then choosing an encoding that is
based on a glyphic variation of SS may give you the presentation that
you desire but hides the distinction at the character level.
[Yes, it's possible to arrange layout engines as well as all text
processing to magically do the right thing, no matter how a text-element
is encoded, and no matter what the cost, but, putting it briefly, the
Unicode philosophy is to model things close to the common understanding
of the text element - unless the script model consistently supports a
non-intuitive approach. I see no recent precedent, incidentally, that,
by itself, would make this decision a slam-dunk, but I tend to dislike
piling complex-script like approaches onto Latin.]
If you desire to carry the distinction between SS and ß in ALL
UPPERCASE TEXT, for semantic reasons, there are currently these three ways:
* Using SZ. This is unattractive because converting the string to lower
case results in nonsense, and few if any text processes consider any
equivalence between 'sz' and ß. It feels unnatural to many readers.
Nevertheless it is used in certain cases.
* Using ß as is. This does not suffer from the aforementioned problem,
but is visually not appealing. Nevertheless, of the three, it is
currently the most widespread solution.
* Using an uppercase form of ß. This is currently only possible with
ad-hoc support. Nevertheless its use can be documented, and given the
technical challenges, is surprisingly frequent.
The proposal (as such) does not change the current orthography. The
proposal (as such) does not even try to standardize on the third form,
but merely proposes that the uppercase form of ß be considered a
character, and implemented as such. [Individuals among the proposers or
elsewhere may have an interest in promoting a change in writing
practices, but it is not Unicode's role to take sides on such larger
issues, and there's little objective reason to fear radical and imminent
change in the majority usage. Raising the threat of such change as if it
was imminent and inevitable would seem to border on fear-mongering, so
lets agree that it is neither].
Given that the use of an uppercase form of ß is clearly a variation of a
(currently more common) practice of using the lowercase form for the
same purpose, a search for a solution should start from the ß and not
from the equivalence to the SS. Because, while that equivalence is
present in the standard orthography, it is explicitly *rejected* by
users of all three alternative ways. Startign from the ß would follow
the principle of least surprise to the users and implementers.
Given that ALL UPPERCASE contexts are relatively uncommon, that
retaining the distinction between ß and SS, is less common than giving
up that distinction as per dictate of the standard orthography, and that
out of three possible ways, only one uses an uppercase form of ß, the
expectation of the *average* German user would first and foremost be
that existing texts and implementation behave as before.
Adding a new character would therefore not change the default case
mapping of ß to SS. Users of the third way would need to enter their new
character by hand, or use special purpose software. The former is
appropriate for signage, book covers, and similar uses. The latter is
what the post office might use in a data processing center entering
hand-filled forms using ß. Institutions maintaining lists of names in
ALL UPPERCASE might utilize similar special purpose software.
For users of the third way, what would change as result of adding a
character is that current ad-hoc solutions could be replaced by
*conformant* solutions with initially equal functionality. To the degree
that certain very common font suites were to add a glyph for this
character, reasonable transmission on the web and in e-mail would work
in the medium term. If the default lowercase mapping of the character is
to the existing ß, name and form data can be converted to standard
orthography by title casing (nouns/names) or lowercasing, which would be
useful (and retain the desired distinction).
Extending the weak equivalence to SS for sorting and searching (by
default) would make data using the new character equally accessible.
Obviously, however, the whole reason for using the ß is so that some
search modes would *not* make that equivalence. Such search modes are
already required to support users of the second way, which is currently
the most common way of supporting the distinction between ß and SS in
ALL UPPERCASE context.
The existence of this 'second way' (retaining lower case ß 'as-is') and
the fact that it is, for now, the most common non-standard way of
retaining the distinction between ß and SS in ALL UPPERCASE context,
means that the third way cannot be considered in isolation. For example,
a lot less could gained by basing the third way on an encoding that is
based on SS, because that makes it different from the second way. On the
contrary, many of the potential complications of, as well as solutions
for, addressing the third way with a new character are already present
because of the second way.
The primary exception on the text processing level would be the lack of
a (default) uppercase mapping from ß to the new character. I concur with
the proposers' judgment that this is not an issue for the *average*
user, and that the adherents of the third way either can live with that
restriction or that they will (be able to) use tailored software. [It's
possible to disagree with that judgment, but that comes down to a matter
of opinion].
The primary exception on the display level would be the lack (for a
transition time) of a glyph in many or most fonts.
It is sometimes claimed that <S, ZWJ, S> would gracefully fall back to
"SS" and that would make it more attractive than the 'missing glyph'
that would ensue if there was a new character, but no glyph in the font.
While the fallback does work wherever the system enforces the
default-ignorable property of ZWJ, it violates the rule of 'no
surprises' since anyone who intends to communicate a distinction between
ß and SS will no longer be able to predict what the other side will see,
and there will be no obvious indication of error. [Users of the third
way that anticipate transmission problems would presumably rather fall
back, manually, to the second way.]
Incidentally, it is equally unclear whether such a ligature could/would
be enabled without affecting the use of all other ligatures in the
document. Ligatures across compound-word internal boundaries are not
desirable in German, and might have to be suppressed individually with
ZWNJ before ligatures could be enabled globally for German text.
Positive ligature support may be absent or may not be controllable in
forms. Such complications can easily mean that using an SS ligature is
equally limiting in practice as using a new character with initially
limited font support.
Lowercasing such data opens a new issue, i.e. that of displaying <s,
ZWJ, s>. If fonts were to utilize a ß glyph for that sequence, which
might only be tempting, then it could encourage a dual representation of
the lower case ß. If they were not, then lowercasing a text that intends
to make a distinction that is unequivocally correct and required in
lower case text, would result it its being removed--unless a special
mapping <S, ZWJ, S> ---> ß were to be widely implemented. [Not to
mention that such a mapping would go against the principle of not having
ZWJ affect casing].
While the facts about actual usage can be established and putative
consequences for both proposed solution and counterproposal can be
mapped, the weighting of this information is and remains a matter of
judgment, and true precedents for such a complicated situation are lacking.
Finally, what of the non-technical factors that UTC should consider when
making encoding decisions?
There seems to be agreement that Unicode does not restrict itself to
standard orthography, that it is descriptive rather than prescriptive,
and that it takes no sides in settling orthographies - but retains the
right to determine how best to reflect a given orthography in an
encoding. All three ways discussed here would qualify for being
encodable, based on their degree of documented usage [two of course are
already encodable].
There's considerably less agreement on how to account for historical
development, including the origin (putative or documented) of a form,
trends in the development of an orthography (observable or speculative)
and predictions of future (or far future) outcomes. In the case at hand,
I tend to believe in the existence of overarching trends, while
simultaneously disbelieving a concrete possibility of real and
widespread change in actual practices on the ground in the near to
medium term.
In terms of stability of properties, it is claimed that proponents of
the third way would ask (eventually) for a change of the mapping from ß
to SS to a mapping from ß to uppercase ß. Well, they might, but my firm
assumption is that UTC will do the research to base its decisions on the
needs of the *average* user. As long as the standard orthography remains
the standard, those needs are unchanged. Not encoding a new character,
by the way is no safeguard, because proponents of the second way (and
there are more of them) could ask for a similar incompatible change in
mapping (to always leave the ß as-is.)
Under the assumption that UTC continues to be able to do due diligence
in this case, neither scenario represents a true risk - up until that
potential far-in-the-future time that the *average* user wants a
different behavior, at which time the UTC has worse problems than
whether the uppercase ß should be a character or <S, ZWJ, S>. [In fact,
in precisely such a case, that elegant fall-back would likely be a true
liability].
For these reasons I continue to support, on balance, the proposal as
submitted and continue to discount many of the scare scenarios. Even
with the addition of a new character, none of the three ways discussed
here are ideal, and neither is the standard orthography as it stands.
However, the existence of these multiple ways is itself a mirror of the
(near glacial) change in interpretation and usage of the ß. This a
historical process, and if Unicode has a role, it is to remain neutral,
but supportive.
A./
This archive was generated by hypermail 2.1.5 : Tue May 08 2007 - 00:53:30 CDT