Re: Representing Unix filenames in Unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Nov 27 2005 - 15:35:23 CST

  • Next message: Chris Jacobs: "Re: Representing Unix filenames in Unicode"

    From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    > "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
    >> If you want to keep the compatibility with null-ended byte strings,
    >> may be the alternative using really non-character code points might
    >> help.
    >
    > What do you mean by "compatibility with null-ended byte streams"?
    >
    > The point in using U+0000 as the escape character is that it does not
    > appear when filenames are converted to Unicode using pure UTF-8. And
    > it's the only such code point (unless we count surrogates, but abusing
    > them would be worse).
    >
    > This means that any filename which can be decoded using pure UTF-8,
    > decodes to the same string using UTF-8-with-escaped-bytes. And any
    > string which can be encoded into a filename using pure UTF-8 at all
    > (i.e. consisting only of code points U+0001..U+D7FF or U+E000..U+10FFFF)
    > encodes to the same string using UTF-8-with-escaped-bytes.
    >
    >> Really, you cannot reach a full bijection for those cases:
    >
    > Actually it would be possible, but it's hard to design a bijection
    > with sensible properties like preserving concatenation and preserving
    > ASCII fragments.
    >
    > But I don't need a bijection: it's acceptable when there are Unicode
    > strings which can't be used as filenames. It's already the case in
    > pure UTF-8 (due to U+0000 and "/").

    Note the following defintiion of dirent is already compatible with
    extensions that would allow storing meta-data after the filename in the same
    dirent entry:

    struct dirent {
        ino_t d_ino; // inode number in the volume
        off_t d_off; // byte offset of the next, non-empty directory entry in
    the actual file system directory. (this can be used for seeking into the
    directory at absolute positions)
        unsigned short d_reclen; // total length of THIS record (including
    padding bytes)
        char d_name[1]; // variable length data for name, nul byte. Max length
    for the name including nul is MAXNAMLEN but does not include possible
    alignment padding bytes
    };

    Note how d_reclen already includes the required terminating nul byte and the
    padding bytes. Nothing forbids the filesystem to include more "padding"
    bytes and use them to store metadata, such as an indicator for the encoding
    with which the filename was created.

    Note also that the dirent structure is not the physical one used in UFS (in
    UFS the "d_off" field is not stored, and the other fields may be ordered
    differently.) The application is not exposed to the physical format of
    directory entries.

    The OS only provides "d_off" as away to allow seeking at anabsolute position
    into the directory file, but the OS states nothing about how d_off is
    correlated with d_reclen (so d_off may be a simple counter incremented by 1
    between each directory entry, or may be a block number, where each dirent
    structure are allocated on the filesystem as an exact multiple of the block
    size which is not exposed here. Nothing in this structure also indicates
    which encoding is actually used in the underlying filesystem).

    So how can this structure be used in applications? Simple: d_reclen is the
    total size of the filesystem independant record, including the "d_ino",
    "d_off", and "d_reclen" fields, and up to MAXNAMLEN bytes for the
    nul-terminated name in d_name. But d_name can be longer than MAXNAMLEN on
    actual filesystem. The above structure is never directly used physically.
    One can include another field in it to store meta-data info (such as the
    encoding of the name in d_name...)

    On older kernels, this meta-data field could be a single additional byte
    stored in the padding area after the first nul byte in d_name, with a magic
    value: for example 08 for UTF-8, given that padding bytes after that first
    nul should all be zeroes.

    Really, Unix filesystems can be fixed and I don't see why this is not done
    so that applications will become aware of that feature (for exemple the
    GLIBC could interpret the presence of the magic byte above to know howto
    convert unambiguously the dirent entry to the encoding currently set in the
    user's POSIX locale, and applications can/should use POSIX functions to
    create files under those conventions).



    This archive was generated by hypermail 2.1.5 : Tue Nov 29 2005 - 12:42:36 CST