Re: Uppercase is coming? (U+1E9E)

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue May 08 2007 - 00:51:46 CDT

  • Next message: Werner LEMBERG: "Re: Uppercase is coming? (U+1E9E)"

    These long messages with interspersed quotes make for colorful reading,
    but obscure the flow of the argument. I'll try to lay it out one more
    time, in order, but with an eye towards answering some of the recent points:

    In German orthography (especially after the recent reform), there is a
    clear distinction between and "ss" in lower case. There are some word
    pairs where it's the only distinction. (The same is true for some
    personal names).

    For purposes of searching personal names, and for sorting words, it's
    expedient to suppress that distinction. Part of that (probably) has to
    do with the fact that spellings of personal names cannot be predicted by
    sound, and sorting similar sounding names together is generally useful.
    Pre-reform, the and ss were used in distinction, but in ways that was
    not as clearly related to pronunciation of the word. Ergo, sorting words
    had the same issues as sorting personal names. However, sorting and
    searching are special in that they often create fairly wide equivalence
    sets, compared to the distinctions needed in accurately writing text or
    names.

    The origin of the SS case mapping for is not actually known with
    certainty. However, it was decreed in a time where the use of Fraktur
    and typewriters were common. Typewriters had extreme limitation in the
    number of signs they could support, and ALL UPPERCASE text in Fraktur is
    an absurdity. Since the does not (ordinarily) occur in TitleCase,
    which is very common in German (nouns), the impact of the standard
    orthographic rule is limited.

    Nevertheless, the post office (on forms), sign writers, certain name
    registries, and many other users that use ALL UPPERCASE text (in modern
    style, not Fraktur), feel that suppressing the distinction between words
    and names that contain and those that contain 'ss' is not appropriate.

    There are three ways this distinction can be maintained in ALL UPPERCASE
    text. Use of SZ, retaining lower case as-is and using an uppercase
    form of . All three forms can be found. And all three ways have their
    adherents. Yes, that means that Germany is not united after all. ;-).

    For the following argument, it is important to not conflate either of
    these three forms with the standard orthography, which does equate
    with "SS" in ALL UPPERCASE text. The standard orthography is the *only*
    one, that (outside sorting and searching) allows an equivalence between
    SS as uppercase of and SS as uppercase of "ss", (while simultaneously
    distinguishing carefully between their lower case forms).

    If you desire to carry the distinction between ss and in lower as well
    as in uppercase, for semantic reasons, then choosing an encoding that is
    based on a glyphic variation of SS may give you the presentation that
    you desire but hides the distinction at the character level.

    [Yes, it's possible to arrange layout engines as well as all text
    processing to magically do the right thing, no matter how a text-element
    is encoded, and no matter what the cost, but, putting it briefly, the
    Unicode philosophy is to model things close to the common understanding
    of the text element - unless the script model consistently supports a
    non-intuitive approach. I see no recent precedent, incidentally, that,
    by itself, would make this decision a slam-dunk, but I tend to dislike
    piling complex-script like approaches onto Latin.]

    If you desire to carry the distinction between SS and in ALL
    UPPERCASE TEXT, for semantic reasons, there are currently these three ways:

    * Using SZ. This is unattractive because converting the string to lower
    case results in nonsense, and few if any text processes consider any
    equivalence between 'sz' and . It feels unnatural to many readers.
    Nevertheless it is used in certain cases.

    * Using as is. This does not suffer from the aforementioned problem,
    but is visually not appealing. Nevertheless, of the three, it is
    currently the most widespread solution.

    * Using an uppercase form of . This is currently only possible with
    ad-hoc support. Nevertheless its use can be documented, and given the
    technical challenges, is surprisingly frequent.

    The proposal (as such) does not change the current orthography. The
    proposal (as such) does not even try to standardize on the third form,
    but merely proposes that the uppercase form of be considered a
    character, and implemented as such. [Individuals among the proposers or
    elsewhere may have an interest in promoting a change in writing
    practices, but it is not Unicode's role to take sides on such larger
    issues, and there's little objective reason to fear radical and imminent
    change in the majority usage. Raising the threat of such change as if it
    was imminent and inevitable would seem to border on fear-mongering, so
    lets agree that it is neither].

    Given that the use of an uppercase form of is clearly a variation of a
    (currently more common) practice of using the lowercase form for the
    same purpose, a search for a solution should start from the and not
    from the equivalence to the SS. Because, while that equivalence is
    present in the standard orthography, it is explicitly *rejected* by
    users of all three alternative ways. Startign from the would follow
    the principle of least surprise to the users and implementers.

    Given that ALL UPPERCASE contexts are relatively uncommon, that
    retaining the distinction between and SS, is less common than giving
    up that distinction as per dictate of the standard orthography, and that
    out of three possible ways, only one uses an uppercase form of , the
    expectation of the *average* German user would first and foremost be
    that existing texts and implementation behave as before.

    Adding a new character would therefore not change the default case
    mapping of to SS. Users of the third way would need to enter their new
    character by hand, or use special purpose software. The former is
    appropriate for signage, book covers, and similar uses. The latter is
    what the post office might use in a data processing center entering
    hand-filled forms using . Institutions maintaining lists of names in
    ALL UPPERCASE might utilize similar special purpose software.

    For users of the third way, what would change as result of adding a
    character is that current ad-hoc solutions could be replaced by
    *conformant* solutions with initially equal functionality. To the degree
    that certain very common font suites were to add a glyph for this
    character, reasonable transmission on the web and in e-mail would work
    in the medium term. If the default lowercase mapping of the character is
    to the existing , name and form data can be converted to standard
    orthography by title casing (nouns/names) or lowercasing, which would be
    useful (and retain the desired distinction).

    Extending the weak equivalence to SS for sorting and searching (by
    default) would make data using the new character equally accessible.
    Obviously, however, the whole reason for using the is so that some
    search modes would *not* make that equivalence. Such search modes are
    already required to support users of the second way, which is currently
    the most common way of supporting the distinction between and SS in
    ALL UPPERCASE context.

    The existence of this 'second way' (retaining lower case 'as-is') and
    the fact that it is, for now, the most common non-standard way of
    retaining the distinction between and SS in ALL UPPERCASE context,
    means that the third way cannot be considered in isolation. For example,
    a lot less could gained by basing the third way on an encoding that is
    based on SS, because that makes it different from the second way. On the
    contrary, many of the potential complications of, as well as solutions
    for, addressing the third way with a new character are already present
    because of the second way.

    The primary exception on the text processing level would be the lack of
    a (default) uppercase mapping from to the new character. I concur with
    the proposers' judgment that this is not an issue for the *average*
    user, and that the adherents of the third way either can live with that
    restriction or that they will (be able to) use tailored software. [It's
    possible to disagree with that judgment, but that comes down to a matter
    of opinion].

    The primary exception on the display level would be the lack (for a
    transition time) of a glyph in many or most fonts.

    It is sometimes claimed that <S, ZWJ, S> would gracefully fall back to
    "SS" and that would make it more attractive than the 'missing glyph'
    that would ensue if there was a new character, but no glyph in the font.
    While the fallback does work wherever the system enforces the
    default-ignorable property of ZWJ, it violates the rule of 'no
    surprises' since anyone who intends to communicate a distinction between
    and SS will no longer be able to predict what the other side will see,
    and there will be no obvious indication of error. [Users of the third
    way that anticipate transmission problems would presumably rather fall
    back, manually, to the second way.]

    Incidentally, it is equally unclear whether such a ligature could/would
    be enabled without affecting the use of all other ligatures in the
    document. Ligatures across compound-word internal boundaries are not
    desirable in German, and might have to be suppressed individually with
    ZWNJ before ligatures could be enabled globally for German text.
    Positive ligature support may be absent or may not be controllable in
    forms. Such complications can easily mean that using an SS ligature is
    equally limiting in practice as using a new character with initially
    limited font support.

    Lowercasing such data opens a new issue, i.e. that of displaying <s,
    ZWJ, s>. If fonts were to utilize a glyph for that sequence, which
    might only be tempting, then it could encourage a dual representation of
    the lower case . If they were not, then lowercasing a text that intends
    to make a distinction that is unequivocally correct and required in
    lower case text, would result it its being removed--unless a special
    mapping <S, ZWJ, S> ---> were to be widely implemented. [Not to
    mention that such a mapping would go against the principle of not having
    ZWJ affect casing].

    While the facts about actual usage can be established and putative
    consequences for both proposed solution and counterproposal can be
    mapped, the weighting of this information is and remains a matter of
    judgment, and true precedents for such a complicated situation are lacking.

    Finally, what of the non-technical factors that UTC should consider when
    making encoding decisions?

    There seems to be agreement that Unicode does not restrict itself to
    standard orthography, that it is descriptive rather than prescriptive,
    and that it takes no sides in settling orthographies - but retains the
    right to determine how best to reflect a given orthography in an
    encoding. All three ways discussed here would qualify for being
    encodable, based on their degree of documented usage [two of course are
    already encodable].

    There's considerably less agreement on how to account for historical
    development, including the origin (putative or documented) of a form,
    trends in the development of an orthography (observable or speculative)
    and predictions of future (or far future) outcomes. In the case at hand,
    I tend to believe in the existence of overarching trends, while
    simultaneously disbelieving a concrete possibility of real and
    widespread change in actual practices on the ground in the near to
    medium term.

    In terms of stability of properties, it is claimed that proponents of
    the third way would ask (eventually) for a change of the mapping from
    to SS to a mapping from to uppercase . Well, they might, but my firm
    assumption is that UTC will do the research to base its decisions on the
    needs of the *average* user. As long as the standard orthography remains
    the standard, those needs are unchanged. Not encoding a new character,
    by the way is no safeguard, because proponents of the second way (and
    there are more of them) could ask for a similar incompatible change in
    mapping (to always leave the as-is.)

    Under the assumption that UTC continues to be able to do due diligence
    in this case, neither scenario represents a true risk - up until that
    potential far-in-the-future time that the *average* user wants a
    different behavior, at which time the UTC has worse problems than
    whether the uppercase should be a character or <S, ZWJ, S>. [In fact,
    in precisely such a case, that elegant fall-back would likely be a true
    liability].

    For these reasons I continue to support, on balance, the proposal as
    submitted and continue to discount many of the scare scenarios. Even
    with the addition of a new character, none of the three ways discussed
    here are ideal, and neither is the standard orthography as it stands.
    However, the existence of these multiple ways is itself a mirror of the
    (near glacial) change in interpretation and usage of the . This a
    historical process, and if Unicode has a role, it is to remain neutral,
    but supportive.

    A./



    This archive was generated by hypermail 2.1.5 : Tue May 08 2007 - 00:53:30 CDT