Latin encodin model (was: Re: slashed letters)

From: Karl Pentzlin (karl-pentzlin@acssoft.de)
Date: Mon Oct 27 2008 - 00:10:54 CST

  • Next message: AndrĂ© Szabolcs Szelp: "Re: Text scans needed containing slashed letters of 19/20th century Latvian and Sorbian orthography"

    Am Montag, 27. Oktober 2008 um 04:23 schrieb Christopher Fynn:

    CF> Why not use:
    CF> G + U+0338 COMBINING LONG SOLIDUS OVERLAY ...

    Until now, Latin characters have been encoded as inseparable entities
    not only for overstruck letters, but also for letters with any kind of
    "fixed" appendages which are not attached simply at the bottom of a
    letter (like ogonek or cedilla).
    Regarding overstruck letters, the Sencoten additions (U+023A ...
    U+023C etc.) or the more recent U+A75E/U+A75F are examples.
    Especially the last one (overstruck V) was added for a specific
    (mediaevist) purpose while not being used in a current orthography.
    All these characters do not even have a compatibility equivalence to a
    sequence containing U+0338.
    Therefore, requiring so for other letters would be an inconsistency in
    Unicode.

    From an abstract view of point, it would have been possible to encode
    such letters a priori as a sequence of basic letter + overstriking
    diacritic, and maybe it had been the preferable way, as Unicode has
    the mechanisms like many South Asian scripts show. This, however, had
    required a U+0338 with explicitly declared semantics for doing so.

    This is something like the Arabic encoding model, where a model based
    on ghost characters + combining marks could have been selected but in
    fact was not.

    Even Latin characters with simple appendages are encoded as
    indivisible entities without employing any compatibility equivalences.
    Examples are the Uighur additions U+2C67 ... U+2C6C (historical use,
    directly comparable to the slashed letters of my proposal) and the
    letters with palatal and retroflex hook U+1D80 ... U+1D9A (pure
    scientific use, not used in any orthography).

    As said, such letters could have been constructed by combining
    elements if the encoding model for Latin had been designed that way,
    giving building elements explicitly devised for doing so, like this
    has been done for most South Asian scripts.
    Changing the Latin encoding model now would require, besides other
    things, the introduction of a new equivalence (in addition to the
    canonical equivalence which is stabilized now) to handle the existing
    letters. Anyway, changing the Latin encoding model after the majority
    of the Latin letters are encoding is not a recommendable task.

    Using existing characters as "Lego blocks" to "build" arbitrarily
    constructed letters, delegating the letter identities to specialized
    fonts or rendering systems, cannot be the purpose of a standard like
    Unicode.

    - Karl Pentzlin



    This archive was generated by hypermail 2.1.5 : Mon Oct 27 2008 - 00:14:13 CST