Re: Medievalist ligature character in the PUA

From: Jukka K. Korpela (
Date: Mon Dec 14 2009 - 14:29:19 CST

  • Next message: Peter Edberg: "Re: Medievalist ligature character in the PUA"

    Michael Everson wrote:

    > On 14 Dec 2009, at 18:55, Peter Edberg wrote:
    >>>> And should an OS treat "My file" and "My file" as the same file
    >>>> name?
    >>> This problem is with us already (on Apple systems, of all things).
    >>> MacOS X decomposes Cyrillic Й and Ё in file names and treats
    >>> файл and файл as the same file name
    >> Which seems appropriate, since they are canonically equivalent.
    > I agree. Canonical equivalence is identity.

    First, й (U+0439) and й (U+0438 U+0306) are not canonically equivalent, or
    even compatibility equivalent. The character й (U+0439) has no
    decomposition. This may be a design flaw, but anyway it’s how things are
    defined in Unicode.

    Second, canonical equivalence is not identity. For example, é (U+00E9) and
    é (U+0065 U+0301) are not identical: the first one is one code point, the
    second one is two code points. (Some programs, maybe even the one I’m using
    now, might silently convert U+0065 U+0301 to U+00E9. This by no means proves
    they’re identical, any more than other silent conversions make e.g.
    hyphen-minus identical to en dash.)

    (The letter Ё is comparable to the é case: it has canonical decomposition.
    But it is still distinct from its decomposition.)

    Canonical equivalent is a relation between sequences of code points.
    Programs may ignore the distinction between canonical equivalent sequences,
    but they also may make any distinction they like between them, and they may
    even recognize just one of canonical equivalent sequences—this is not
    uncommon in older software, which may support e.g. é as a precomposed
    character but not even recognize the combining acute accent.

    Thus, although файл and файл are definitely different strings, programs may
    and often do treat them as equivalent or, you might say, ”identical” for
    some definition of ”identity”—but then it’s a definition external to
    Unicode. Similarly, a file system might treat, say, ”My file” and ”Myfile”
    and ”MYFILE” and ”My%20file” all as ”identical” in the sense of naming the
    same file, even though they are of course different as strings.

    > So long as fonts display
    > the pre-composed glyph there should be no problem.

    It’s mostly confusing to consider display issues here. Besides, you surely
    know that fonts don’t do such things—rendering software might decide to
    render a character sequence as a ligature, but that’s a different issue.


    This archive was generated by hypermail 2.1.5 : Mon Dec 14 2009 - 14:30:47 CST