From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue May 08 2007 - 00:51:46 CDT
These long messages with interspersed quotes make for colorful reading, 
but obscure the flow of the argument. I'll try to lay it out one more 
time, in order, but with an eye towards answering some of the recent points:
In German orthography (especially after the recent reform), there is a 
clear distinction between ß and "ss" in lower case. There are some word 
pairs where it's the only distinction. (The same is true for some 
personal names).
For purposes of searching personal names, and for sorting words, it's 
expedient to suppress that distinction. Part of that (probably) has to 
do with the fact that spellings of personal names cannot be predicted by 
sound, and sorting similar sounding names together is generally useful. 
Pre-reform, the ß and ss were used in distinction, but in ways that was 
not as clearly related to pronunciation of the word. Ergo, sorting words 
had the same issues as sorting personal names. However, sorting and 
searching are special in that they often create fairly wide equivalence 
sets, compared to the distinctions needed in accurately writing text or 
names.
The origin of the SS case mapping for ß is not actually known with 
certainty. However, it was decreed in a time where the use of Fraktur 
and typewriters were common. Typewriters had extreme limitation in the 
number of signs they could support, and ALL UPPERCASE text in Fraktur is 
an absurdity. Since the ß does not (ordinarily) occur in TitleCase, 
which is very common in German (nouns), the impact of the standard 
orthographic rule is limited.
Nevertheless, the post office (on forms), sign writers, certain name 
registries, and many other users that use ALL UPPERCASE text (in modern 
style, not Fraktur), feel that suppressing the distinction between words 
and names that contain ß and those that contain 'ss' is not appropriate.
There are three ways this distinction can be maintained in ALL UPPERCASE 
text. Use of SZ, retaining lower case ß as-is and using an uppercase 
form of ß. All three forms can be found. And all three ways have their 
adherents. Yes, that means that Germany is not united after all. ;-).
For the following argument, it is important to not conflate either of 
these three forms with the standard orthography, which does equate ß 
with "SS" in ALL UPPERCASE text. The standard orthography is the *only* 
one, that (outside sorting and searching) allows an equivalence between 
SS as uppercase of ß and SS as uppercase of "ss", (while simultaneously 
distinguishing carefully between their lower case forms).
If you desire to carry the distinction between ss and ß in lower as well 
as in uppercase, for semantic reasons, then choosing an encoding that is 
based on a glyphic variation of SS may give you the presentation that 
you desire but hides the distinction at the character level.
[Yes, it's possible to arrange layout engines as well as all text 
processing to magically do the right thing, no matter how a text-element 
is encoded, and no matter what the cost, but, putting it briefly, the 
Unicode philosophy is to model things close to the common understanding 
of the text element - unless the script model consistently supports a 
non-intuitive approach. I see no recent precedent, incidentally, that, 
by itself, would make this decision a slam-dunk, but I tend to dislike 
piling complex-script like approaches onto Latin.]
If you desire to carry the distinction between SS and ß  in ALL 
UPPERCASE TEXT, for semantic reasons, there are currently these three ways:
* Using SZ. This is unattractive because converting the string to lower 
case results in nonsense, and few if any text processes consider any 
equivalence between 'sz' and ß. It feels unnatural to many readers. 
Nevertheless it is used in certain cases.
* Using ß as is. This does not suffer from the aforementioned problem, 
but is visually not appealing. Nevertheless, of the three, it is 
currently the most widespread solution.
* Using an uppercase form of ß. This is currently only possible with 
ad-hoc support. Nevertheless its use can be documented, and given the 
technical challenges, is surprisingly frequent.
The proposal (as such) does not change the current orthography. The 
proposal (as such) does not even try to standardize on the third form, 
but merely proposes that the uppercase form of ß be considered a 
character, and implemented as such. [Individuals among the proposers or 
elsewhere may have an interest in promoting a change in writing 
practices, but it is not Unicode's role to take sides on such larger 
issues, and there's little objective reason to fear radical and imminent 
change in the majority usage. Raising the threat of such change as if it 
was imminent and inevitable would seem to border on fear-mongering, so 
lets agree that it is neither].
Given that the use of an uppercase form of ß is clearly a variation of a 
(currently more common) practice of using the lowercase form for the 
same purpose, a search for a solution should start from the ß and not 
from the equivalence to the SS. Because, while that equivalence is 
present in the standard orthography, it is explicitly *rejected* by 
users of all three alternative ways. Startign from the ß would follow 
the principle of least surprise to the users and implementers.
Given that ALL UPPERCASE contexts are relatively uncommon, that 
retaining the distinction between ß and SS, is less common than giving 
up that distinction as per dictate of the standard orthography, and that 
out of three possible ways, only one uses an uppercase form of ß, the 
expectation of the *average* German user would first and foremost be 
that existing texts and implementation behave as before.
Adding a new character would therefore not change the default case 
mapping of ß to SS. Users of the third way would need to enter their new 
character by hand, or use special purpose software. The former is 
appropriate for signage, book covers, and similar uses. The latter is 
what the post office might use in a data processing center entering 
hand-filled forms using ß. Institutions maintaining lists of names in 
ALL UPPERCASE might utilize similar special purpose software.
For users of the third way, what would change as result of adding a 
character is that current ad-hoc solutions could be replaced by 
*conformant* solutions with initially equal functionality. To the degree 
that certain very common font suites were to add a glyph for this 
character, reasonable transmission on the web and in e-mail would work 
in the medium term. If the default lowercase mapping of the character is 
to the existing ß, name and form data can be converted to standard 
orthography by title casing (nouns/names) or lowercasing, which would be 
useful (and retain the desired distinction).
Extending the weak equivalence to SS for sorting and searching (by 
default) would make data using the new character equally accessible. 
Obviously, however, the whole reason for using the ß is so that some 
search modes would *not* make that equivalence. Such search modes are 
already required to support users of the second way, which is currently 
the most common way of supporting the distinction between ß and SS in 
ALL UPPERCASE context.
The existence of this 'second way' (retaining lower case ß 'as-is') and 
the fact that it is, for now, the most common non-standard way of 
retaining the distinction between ß and SS in ALL UPPERCASE context, 
means that the third way cannot be considered in isolation. For example, 
a lot less could gained by basing the third way on an encoding that is 
based on SS, because that makes it different from the second way. On the 
contrary, many of the potential complications of, as well as solutions 
for, addressing the third way with a new character are already present 
because of the second way.
The primary exception on the text processing level would be the lack of 
a (default) uppercase mapping from ß to the new character. I concur with 
the proposers' judgment that this is not an issue for the *average* 
user, and that the adherents of the third way either can live with that 
restriction or that they will (be able to) use tailored software. [It's 
possible to disagree with that judgment, but that comes down to a matter 
of opinion].
The primary exception on the display level would be the lack (for a 
transition time) of a glyph in many or most fonts.
It is sometimes claimed that <S, ZWJ, S> would gracefully fall back to 
"SS" and that would make it more attractive than the 'missing glyph' 
that would ensue if there was a new character, but no glyph in the font. 
While the fallback does work wherever the system enforces the 
default-ignorable property of ZWJ, it violates the rule of 'no 
surprises' since anyone who intends to communicate a distinction between 
ß and SS will no longer be able to predict what the other side will see, 
and there will be no obvious indication of error. [Users of the third 
way that anticipate transmission problems would presumably rather fall 
back, manually, to the second way.]
Incidentally, it is equally unclear whether such a ligature could/would 
be enabled without affecting the use of all other ligatures in the 
document. Ligatures across compound-word internal boundaries are not 
desirable in German, and might  have to be suppressed individually with 
ZWNJ before ligatures could be enabled globally for German text. 
Positive ligature support may be absent or may not be controllable in 
forms. Such complications can easily mean that using an SS ligature is 
equally limiting in practice as using a new character with initially 
limited font support.
Lowercasing such data opens a new issue, i.e. that of displaying <s, 
ZWJ, s>. If fonts were to utilize a ß glyph for that sequence, which 
might only be tempting, then it could encourage a dual representation of 
the lower case ß. If they were not, then lowercasing a text that intends 
to make a distinction that is unequivocally correct and required in 
lower case text, would result it its being removed--unless a special 
mapping <S, ZWJ, S> ---> ß were to be widely implemented. [Not to 
mention that such a mapping would go against the principle of not having 
ZWJ affect casing].
While the facts about actual usage can be established and putative 
consequences for both proposed solution and counterproposal can be 
mapped, the weighting of this information is and remains a matter of 
judgment, and true precedents for such a complicated situation are lacking.
Finally, what of the non-technical factors that UTC should consider when 
making encoding decisions?
There seems to be agreement that Unicode does not restrict itself to 
standard orthography, that it is descriptive rather than prescriptive, 
and that it takes no sides in settling orthographies - but retains the 
right to determine how best to reflect a given orthography in an 
encoding. All three ways discussed here would qualify for being 
encodable, based on their degree of documented usage [two of course are 
already encodable].
There's considerably less agreement on how to account for historical 
development, including the origin (putative or documented) of a form, 
trends in the development of an orthography (observable or speculative) 
and predictions of future (or far future) outcomes. In the case at hand, 
I tend to believe in the existence of overarching trends, while 
simultaneously disbelieving a concrete possibility of real and 
widespread change in actual practices on the ground in the near to 
medium term.
In terms of stability of properties, it is claimed that proponents of 
the third way would ask (eventually) for a change of the mapping from ß 
to SS to a mapping from ß to uppercase ß. Well, they might, but my firm 
assumption is that UTC will do the research to base its decisions on the 
needs of the *average* user. As long as the standard orthography remains 
the standard, those needs are unchanged. Not encoding a new character, 
by the way is no safeguard, because proponents of the second way (and 
there are more of them) could ask for a similar incompatible change in 
mapping (to always leave the ß as-is.)
Under the assumption that UTC continues to be able to do due diligence 
in this case, neither scenario represents a true risk - up until that 
potential far-in-the-future time that the *average* user wants a 
different behavior, at which time the UTC has worse problems than 
whether the uppercase ß should be a character or <S, ZWJ, S>. [In fact, 
in precisely such a case, that elegant fall-back would likely be a true 
liability].
For these reasons I continue to support, on balance, the proposal as 
submitted and continue to discount many of the scare scenarios. Even 
with the addition of a new character, none of the three ways discussed 
here are ideal, and neither is the standard orthography as it stands. 
However, the existence of these multiple ways is itself a mirror of the 
(near glacial) change in interpretation and usage of the ß. This a 
historical process, and if Unicode has a role, it is to remain neutral, 
but supportive.
A./
This archive was generated by hypermail 2.1.5 : Tue May 08 2007 - 00:53:30 CDT