From: Asmus Freytag (email@example.com)
Date: Tue Dec 15 2009 - 10:43:29 CST
On 12/15/2009 2:31 AM, Julian Bradfield wrote:
> On 2009-12-14, Michael Everson <firstname.lastname@example.org> wrote:
>> On 14 Dec 2009, at 20:56, Julian Bradfield wrote:
>> Evidently I was not using [identify] in a technical sense.
> The technical sense is also the normal English sense. Things are
> "identical" if they're exactly the same.
The analogy here is a bit different - depending on your view. Michael
would maintain that the "things" are the (abstract) characters and not
the code unit sequence that you happen to use to describe them. Both in
the technical as well as in the normal English sense, one and the same
thing may have more than one description.
>>> What you presumably mean is "the space in which filenames live
>>> *ought* to be the set of utf-8 strings quotiented by canonical
>>> equivalence" (so that two canonically equivalent strings are
>>> representatives of one and the same filename).
>> No, that's not what I meant.
>> I meant that é 00E9 and é 0065 0301 the same platonic entity (acute
>> e) in an intrinsic sense, whereas both are different from a Cyrillic
>> lookalike, е́ 0435 0301.
>> *That* kind of identity.
> How does what you said differ from what I said, except that I said it
> precisely? Your "platonic entity" is my "equivalence
> class of UTF-8 strings under canonical equivalence". That defines an
> identity on the "platonic entities", NOT on the UTF-8 strings.
Correct, you are both saying the same thing here - but...
> As Asmus has pointed out, the question then is, do you ask users to
> understand this, and magically know that two apparently different
> strings are actually the same?
This is where the disconnect is, and where you may be misquoting me. The
typical user knows a writing system but not the code sequence.
Programmers have tools that make code sequences visible to them, so they
can distinguish them. Correctly formatted and displayed, ordinary users
cannot tell the difference between alternative code sequences for the
same abstract character. That is as it should be, because what is
encoded is the abstract character.
What systems designers have done in some cases is to force users to act
like programmers (in some cases because implementations were using
Unicode before normalization was settled).
Unix users have inherited the mess created by the design approach that
was based on "character set independence". That approach seemed a nice,
value-neutral way to handle competing character sets, until it became
clear that it would in many instances lead to the creation of
effectively uninterpretable byte-streams. Hence Unicode. But all of that
is, of course, history.
> If they're Windows users, they're used to this, because of the mess
> with case of filenames in FAT, but if they're Unix users, they're not
> at all used to it.
> On the other hand, the complexities of dealing with Unicode
> equivalence are a whole different league from dealing with simple case
Precisely. The question of case equivalence or not is on a different
level. Here you have visible distinction and it is a matter of
convention whether "FILE", "File", "file" represent the same label or
three different ones. Conventions are arbitrary and disagreements about
them are common.
How the encoding relates an abstract character to code sequence(s), on
the other hand, is well defined in the Standard.
> I don't know what the right answer is - except to agree that it ought
> to be possible for a file system to be marked as only allowing UTF-8
> filenames, in some normalized form.
This archive was generated by hypermail 2.1.5 : Tue Dec 15 2009 - 10:46:34 CST