Re: Medievalist ligature character in the PUA

From: Asmus Freytag (
Date: Tue Dec 15 2009 - 10:43:29 CST

  • Next message: Asmus Freytag: "Re: Medievalist ligature character in the PUA"

    On 12/15/2009 2:31 AM, Julian Bradfield wrote:
    > On 2009-12-14, Michael Everson <> wrote:
    >> On 14 Dec 2009, at 20:56, Julian Bradfield wrote:
    >>> [...]
    >> Evidently I was not using [identify] in a technical sense.
    > The technical sense is also the normal English sense. Things are
    > "identical" if they're exactly the same.
    The analogy here is a bit different - depending on your view. Michael
    would maintain that the "things" are the (abstract) characters and not
    the code unit sequence that you happen to use to describe them. Both in
    the technical as well as in the normal English sense, one and the same
    thing may have more than one description.
    >>> What you presumably mean is "the space in which filenames live
    >>> *ought* to be the set of utf-8 strings quotiented by canonical
    >>> equivalence" (so that two canonically equivalent strings are
    >>> representatives of one and the same filename).
    >> No, that's not what I meant.
    >> I meant that é 00E9 and é 0065 0301 the same platonic entity (acute
    >> e) in an intrinsic sense, whereas both are different from a Cyrillic
    >> lookalike, е́ 0435 0301.
    >> *That* kind of identity.
    > How does what you said differ from what I said, except that I said it
    > precisely? Your "platonic entity" is my "equivalence
    > class of UTF-8 strings under canonical equivalence". That defines an
    > identity on the "platonic entities", NOT on the UTF-8 strings.
    Correct, you are both saying the same thing here - but...
    > As Asmus has pointed out, the question then is, do you ask users to
    > understand this, and magically know that two apparently different
    > strings are actually the same?
    This is where the disconnect is, and where you may be misquoting me. The
    typical user knows a writing system but not the code sequence.
    Programmers have tools that make code sequences visible to them, so they
    can distinguish them. Correctly formatted and displayed, ordinary users
    cannot tell the difference between alternative code sequences for the
    same abstract character. That is as it should be, because what is
    encoded is the abstract character.

    What systems designers have done in some cases is to force users to act
    like programmers (in some cases because implementations were using
    Unicode before normalization was settled).

    Unix users have inherited the mess created by the design approach that
    was based on "character set independence". That approach seemed a nice,
    value-neutral way to handle competing character sets, until it became
    clear that it would in many instances lead to the creation of
    effectively uninterpretable byte-streams. Hence Unicode. But all of that
    is, of course, history.
    > If they're Windows users, they're used to this, because of the mess
    > with case of filenames in FAT, but if they're Unix users, they're not
    > at all used to it.
    > On the other hand, the complexities of dealing with Unicode
    > equivalence are a whole different league from dealing with simple case
    > collapsing.
    Precisely. The question of case equivalence or not is on a different
    level. Here you have visible distinction and it is a matter of
    convention whether "FILE", "File", "file" represent the same label or
    three different ones. Conventions are arbitrary and disagreements about
    them are common.

    How the encoding relates an abstract character to code sequence(s), on
    the other hand, is well defined in the Standard.
    > I don't know what the right answer is - except to agree that it ought
    > to be possible for a file system to be marked as only allowing UTF-8
    > filenames, in some normalized form.

    This archive was generated by hypermail 2.1.5 : Tue Dec 15 2009 - 10:46:34 CST