Re: Medievalist ligature character in the PUA

From: Asmus Freytag (
Date: Tue Dec 15 2009 - 15:30:16 CST

  • Next message: Asmus Freytag: "Re: Medievalist ligature character in the PUA"

    On 12/15/2009 11:17 AM, Julian Bradfield wrote:
    > Asmus wrote:
    >> On 12/15/2009 2:31 AM, Julian Bradfield wrote:
    >>> On 2009-12-14, Michael Everson <> wrote:
    >>>> On 14 Dec 2009, at 20:56, Julian Bradfield wrote:
    > ...
    >>> As Asmus has pointed out, the question then is, do you ask users to
    >>> understand this, and magically know that two apparently different
    >>> strings are actually the same?
    >> This is where the disconnect is, and where you may be misquoting me. The
    >> typical user knows a writing system but not the code sequence.
    >> Programmers have tools that make code sequences visible to them, so they
    >> can distinguish them. Correctly formatted and displayed, ordinary users
    >> cannot tell the difference between alternative code sequences for the
    >> same abstract character. That is as it should be, because what is
    >> encoded is the abstract character.
    > Yes - but how many users can distinguish the different abstract
    > characters (Latin) o, (Greek) ο and (Cyrillic) о ? I certainly
    > can't.

    Most users have no problem with any of these, because, except in
    somewhat artificial test cases, they tend to be used in context with
    other Latin, Greek or Cyrillic letters, respectively. And, going beyond
    that, most users stick to one of these three scripts for the majority of
    their interaction with their computers. That does not mean that there
    aren't any real-world issues.

    > Is this inherently different from the distinction between
    > precomposed and combining characters?
    Yes, because the combining characters themselves are part and parcel of
    the methodology of mapping writing systems to binary encoded data. And
    precomposed characters are an artifact of the history of encoding
    characters. (There are also way more of these duplicates than of the
    other type).

    In contrast, the example you gave above is a result of the historical
    development of writing systems (outside the sphere of their digital
    encoding). On another (platonic?) level, the o and omicron once were
    identical. But they are no longer. (and strictly speaking it ever
    applied to their upper case forms only) There is no presumption, other
    than typographic, of their having the exact same representation. In
    fact, especially the Greek letter, is often rendered in a noticeably
    different style, because many fonts show Greek in a different style from
    Latin. (View them with any font created for the JIS character set and
    they likely will immediately look distinct - that's what I did to verify
    that you didn't cheat).
    >> Unix users have inherited the mess created by the design approach that
    >> was based on "character set independence". That approach seemed a nice,
    >> value-neutral way to handle competing character sets, until it became
    >> clear that it would in many instances lead to the creation of
    >> effectively uninterpretable byte-streams. Hence Unicode. But all of that
    >> is, of course, history.
    > I wonder why we didn't settle on IS2022 encoded filenames before
    > Uniocde came along? Just because of the overhead? Or just because of
    > the timeline of non-ASCII use of computers?
    Because of Unicode (and 10646). Absent these efforts to create a
    unifying character set, 2022 would have been the only choice - and as
    you note, the overhead would have been horrendous. Web-access for small
    devices anyone?
    >> How the encoding relates an abstract character to code sequence(s), on
    >> the other hand, is well defined in the Standard.
    > But the definition of abstract character doesn't necessarily match
    > what users think!
    And doesn't have to. As long as a given sequence of abstract characters
    is rendered and processed in a manner the users expect, the actual
    internal divisions are relatively irrelevant. If support of combining
    accents had been present and seamless from day one, you can argue that
    no-one would have missed the precomposed characters.


    This archive was generated by hypermail 2.1.5 : Tue Dec 15 2009 - 15:32:58 CST