From: Asmus Freytag (firstname.lastname@example.org)
Date: Mon Dec 14 2009 - 19:53:10 CST
On 12/14/2009 1:35 PM, Michael Everson wrote:
> On 14 Dec 2009, at 20:56, Julian Bradfield wrote:
>> On 2009-12-14, Michael Everson <email@example.com> wrote:
>>> I agree. Canonical equivalence is identity.
>> That's a nonsensical statement. Well, actually it's not nonsensical,
>> it's just plain wrong.
>> Everybody who uses the word "identity" in a technical sense knows
>> what it means, and it doesn't mean "has different bytes".
> Evidently I was not using it in a technical sense.
>> What you presumably mean is "the space in which filenames live
>> *ought* to be the set of utf-8 strings quotiented by canonical
>> equivalence" (so that two canonically equivalent strings are
>> representatives of one and the same filename).
> No, that's not what I meant.
> I meant that é 00E9 and é 0065 0301 the same platonic entity (acute e)
> in an intrinsic sense, whereas both are different from a Cyrillic
> lookalike, е́ 0435 0301.
> *That* kind of identity.
Which, formally, is an equivalence, hence Unicode's term: "canonically
equivalent" - which separates it out from myriads of other possible ways
under which two code sequences could be considered equivalent by
different user communities.
What people confuse here, and what you were trying to address with your
"platonic entity" is that there is a distinction between the character
(abstract character) and its encoding in actual data (code unit sequence).
Because of the UTFs, Unicode has at least three levels. The abstract
character, the numeric value (the integer between 0 and 1114109) and the
bytes, words and double-words of the encoding form.
That two different sequences of code units refer to the same coded
character is usually taken in stride, because the mappings are lossless.
That more than one code sequence can refer to the same abstract
character is problematic, because there's a choice when going from
abstract character to encoding.
But what is the correct level for allowing users to make differences in
naming objects on a file system? Logically, it is the abstract
character, even if for various reasons of engineering that has not
happened. Some systems go further, and apply other equivalences (case,
mostly), but at that moment you leave the abstraction level of the
encoding and enter the realm of convention.
From there, to "religious" wars, is a short step.
This archive was generated by hypermail 2.1.5 : Mon Dec 14 2009 - 19:54:17 CST