From: Philippe Verdy (email@example.com)
Date: Thu Dec 09 2004 - 10:04:13 CST
From: "Antoine Leca" <Antoine10646@leca-marti.org>
> Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
> *nix application that displays filenames need to know the encoding to use
> the correct set of glyphs (but constrainst are much more heavy.) Also
> Windows NT Unicode applications know it, because it can't be changed :-).
> But when it comes to other Windows applications (still the more common)
> happen to operate in 'Ansi' mode, they are subject to the hazard of
> translations. Even if Windows 'knows' the encoding used for the filesystem
> (as when it uses NTFS or Joliet, or VFAT on NT kernels; in the other cases
> it does not even know it, much like with *nix kernels), the only usable
> is the _intersection_ of the set used to write and the set used to read;
> that is, usually, it is restricted to US ASCII, very much like the usable
> set in *nix cases...
True, but this applies to FAT-only filesystems, which happen to store
filenames with a "OEM" charset which is not stored explicitly on the volume.
This is a known caveat even for Unix, when you look at the tricky details of
the support of Windows file sharing through Samba, when the client requests
a file with a "short" 8.3 name, that a partition used by Windows is supposed
In fact, this nightmare comes from the support in Windows of the
compatibility with legacy DOS applications which don't know the details and
don't use the Win32 APIs with Unicode support. Note that DOS applications
use a "OEM" charset which is part of the user settings, not part of the
system settings (see the effects of the command CHCP in a DOS command
FAT32 and NTFS help reconciliate these incompatible charsets because these
filesystems also store a "LFN" (Long File Name) for the same files (in that
case the short name, encoded in some ambiguous OEM charset, is just an
alias, acting exactly like a hard link on Unix created in the same directory
that references the same file). "LFN" names are UTF-16 encoded and support
mostly the same names as in NTFS volumes.
However, on FAT32 volumes, the short names are mandatory, unlike on NTFS
volumes where they can be created "on the fly" by the filesystem driver,
according to the current user settings for the selected OEM charset, without
storing them explicitly on the volume. Windows contains, in CHKDSK, a way to
verify that short names of FAT32 filesystems are properly encoded with a
coherent OEM charset, using the UTF-16 encoded LFN names as a reference. If
needed, corrections for the OEM charset can be applied...
This nightmare of incompatible OEM charsets do happen on Windows 98/98SE/ME,
when the "autoexec.bat" file that defines the current user profile is not
executing as it should the proper "CHCP" command, or when this autoexec.bat
file has been modified or erased: in that case, the default OEM charset
(codepage 437) is used, and short filenames are incorrectly encoded.
Another complexity is that Win32 applications, that use a fixed (not
user-settable) "ANSI" charset, and that don't use the Unicode API depend on
the conversion from the ANSI charset to the current OEM charset. But if a
file is handled through some directory shares via multiple hosts, that have
distinct ANSI charsets (i.e. Windows hosts running different localization of
Windows, such as a US installation and a French version in the same LAN),
the charsets viewed by these hosts will create incompatible encodings on the
same shared volume.
So the only "stable" subset for short names, that is not affected by OS
localization or user settings is the intersection of all possible ANSI and
OEM charsets that can be set in all versions of Windows! No need to say,
this designates only the printable ASCII charset for short 8.3 names. Long
filenames are not affected by this problem.
Conclusion: to use international characters out of ASCII in filenames used
by Windows, make sure that the the name is not in a 8.3 short format, so
that a long filename, in UTF-16, will be created on FAT32 filesystems or on
SMBFS shares (Samba on Unix/Linux, Windows servers)... Or use NTFS (but then
resolve the interoperability problems with Linux/Unix client hosts that
can't access reliably, for now, to these filesystems, and that are not
completely emulated by Unix filesystems used by Samba, due to the limitation
on the LanMan sharing protocol, and limitations of Unix filesystems as well
that rarely use UTF-8 as their prefered encoding...)
This archive was generated by hypermail 2.1.5 : Thu Dec 09 2004 - 10:11:33 CST