Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Alastair Houghton via Unicode <unicode_at_unicode.org>
Date: Tue, 16 May 2017 16:30:09 +0100

On 16 May 2017, at 14:23, Hans Åberg via Unicode <unicode_at_unicode.org> wrote:
>
> You don't. You have a filename, which is a octet sequence of unknown encoding, and want to deal with it. Therefore, valid Unicode transformations of the filename may result in that is is not being reachable.
>
> It only matters that the correct octet sequence is handed back to the filesystem. All current filsystems, as far as experts could recall, use octet sequences at the lowest level; whatever encoding is used is built in a layer above.

HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. FAT 8.3 names are also encoded, but the encoding isn’t specified (more specifically, MS-DOS and Windows assume an encoding based on your locale, which could cause all kinds of fun if you swapped disks with someone from a different country, and IIRC there are some shenanigans for Japan because of the use of 0xe5 as a deleted file marker). There are some less widely used filesystems that require a particular encoding also (BeOS’ BFS used UTF-8, for instance).

Also, Mac OS X and iOS use UTF-8 at the BSD layer; if a filesystem is in use whose names can’t be converted to UTF-8, the Darwin kernel uses a percent encoding scheme(!)

It looks like Apple has changed its mind for APFS and is going with the “bag of bytes” approach that’s typical of other systems; at least, that’s what it appears to have done on iOS.

Kind regards,

Alastair.

--
http://alastairs-place.net
Received on Tue May 16 2017 - 10:30:41 CDT

This archive was generated by hypermail 2.2.0 : Tue May 16 2017 - 10:30:42 CDT