Re: Representing Unix filenames in Unicode

From: Neil Harris (neil@tonal.clara.co.uk)
Date: Mon Nov 28 2005 - 13:49:02 CST

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"

Previous message: Doug Ewell: "Re: Representing Unix filenames in Unicode"
In reply to: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Next in thread: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg wrote:
> On 28 Nov 2005, at 03:44, Doug Ewell wrote:
>
>> Whatever you guys decide, please let's not have any proposals to
>> "improve" UTF-8, or invent a mutant form of UTF-8, by giving it a way
>> to map these arbitrary byte sequences bijectively while
>> simultaneously retaining the existing properties of UTF-8. We had
>> that discussion a while back. The first one to suggest "fixing"
>> UTF-8 automatically loses.
>
> My guess is that it is simplest to store UTF-8 names as is as
> byte-strings on the low level, possibly with some information whether
> it is ASCII or UTF-8 (or possibly some encoding), which is important
> in UNIX. Then the problem arises what to do when low filenames appear
> which cannot be given UTF-8 interpretation. Letting the low level file
> handling having to bother with that seems to be a bad idea: it does
> not need that, and interpretations will just complicate and slow
> things down. So then the idea I presented is to simply encode this to
> consistent UTF-8 in way that the original byte string can be converted
> back. A UNIX context may though need more than one invertible
> byte-string UTF-8 encoding, say if one is considering filenames,
> filepaths or filepath sequences. The question is truly tricky though.
> One must think through waht will happen with all standard UNIX
> programs that interprets byte strings and character strings. So I
> would prefer to leave it to those UNIX experts to work it out.
>
> Hans Aberg
>
>
The set of ASCII strings is a proper subset of the set of UTF-8 strings,
so no information would need to be stored about which of those coding
was being used.

Now, ISO 8859-1, that's a different matter -- I suppose you could still
use the property that _almost all_ non-pure-ASCII ISO 8859-1 natural
language strings are not also valid UTF-8 strings for backwards
compatibility, and ditto for most other fixed 8-bit encodings, but I
certainly wouldn't be willing to trust my filesystem to this sort of hack.

-- Neil

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"
Previous message: Doug Ewell: "Re: Representing Unix filenames in Unicode"
In reply to: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Next in thread: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Reply: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Nov 28 2005 - 18:57:59 CST