Re: Medievalist ligature character in the PUA

From: verdy_p (verdy_p@wanadoo.fr)
Date: Tue Dec 15 2009 - 05:51:47 CST

  • Next message: verdy_p: "Re: Medievalist ligature character in the PUA"

    > Message du 15/12/09 12:11
    > De : "Julian Bradfield"
    > A : unicode@unicode.org
    > Copie à :
    > Objet : Re: Medievalist ligature character in the PUA
    >
    >
    > On 2009-12-14, Michael Everson wrote:
    > > On 14 Dec 2009, at 20:56, Julian Bradfield wrote:
    > >>[...]
    > > Evidently I was not using [identify] in a technical sense.
    >
    > The technical sense is also the normal English sense. Things are
    > "identical" if they're exactly the same.
    >
    > >> What you presumably mean is "the space in which filenames live
    > >> *ought* to be the set of utf-8 strings quotiented by canonical
    > >> equivalence" (so that two canonically equivalent strings are
    > >> representatives of one and the same filename).
    > >
    > > No, that's not what I meant.
    > >
    > > I meant that é 00E9 and é 0065 0301 the same platonic entity (acute
    > > e) in an intrinsic sense, whereas both are different from a Cyrillic
    > > lookalike, е́ 0435 0301.
    > >
    > > *That* kind of identity.
    >
    > How does what you said differ from what I said, except that I said it
    > precisely? Your "platonic entity" is my "equivalence
    > class of UTF-8 strings under canonical equivalence". That defines an
    > identity on the "platonic entities", NOT on the UTF-8 strings.

    UTF-8 does not need to be specified in your definition of equvalent classes. Note that UTF-8 is strictly a
    conforming transform which already makes canonical equivalences with the set of code points (but only those that
    have a scalar value, so excluding the range of surrogates) that UTF-8 strings represent (but NOT the set of valid
    characters...

    Note then that a valid UTF-8 string is not necessarily a conforming Unicode text (because a UTF-8 string MAY contain
    some code points that DO have a scalar value but that are NOT characters, such as U+FFFF).

    As this discussion is about characters that have an identity in Unicode, it should exclude non-characters (even if
    they MAY exist in UTF-8 strings) : non-characters do not have an identity and are not encoded as characters (they
    are also excluded from normalized forms, so they cannot be part of any Unicode canonical equivalence strings).

    So the reply by Michael Everson is more precise than yours.

    If you prefer, you can however speak about "equivalence class of valid Unicode text under canonical equivalence":
    this will be valid classes covering only the meaningful part of valid UTF-8 strings, but also the bijectively
    equivalent classes within UTF-32 strings or within UTF-16 strings or within BOCU-1 strings (but *not* SCSU strings
    as SCSU does not provide a single encoding for the same texts, something that will make SCSU a very bad encoding for
    filenames stored in a filesystem), and you'll be independant of the actual encoding used (provided that this is a
    conforming Unicode transform scheme).

    ----
    Note that BOCU-1 (despite its unique encoding and the fact that it is a conforming Unicode transform) will also be a 
    bad choice for filenames in the filesystem, because it does not preserve the "/" character or the special names like 
    "." or ".." used in most hierarchical filesystems (including FTP and HTTP in their URLs).
    But a variant based on the more general BOCU algorithm (patented by IBM and not usable without requesting a licence 
    to IBM, with the only exception of BOCU-1 for which there's a free licence for free use as long as the 
    implementation is conforming *strictly* to its specification and then disallows all extensions or variants) could be 
    used for filenames in filesystems, if it is used to preserve some other characters like ".", "/" and "\" but also 
    some characters generally used in shells for wildcards like "+", "*", "?", "[", "]", or in shell syntaxes like "(", 
    ")", and single and double quotation marks and some other punctuations like "," and ";" and braces. as far as I 
    known, I've not seen any BOCU based encoding used in any filesystem (but may be it exists in IBM's AiX?)
    


    This archive was generated by hypermail 2.1.5 : Tue Dec 15 2009 - 05:54:14 CST