Re: _Unicode_code_page_and_?.net

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 31 Jul 2013 14:38:13 +0200

My opinion is that the term "code page" in Windows does not really
designate an encoding, or any code page switching at all (dynamic switches
with contextual states for the current page). It just indicates an encoding
scheme that will be processed by the device specifying it.

Winfows complicates thngs a bit because it works with several "code pages"
simultaneously, depending on the API used ("ANSI_CP" for the GUI Windows
API, "OEM_CP" for the Console Windows API, another code page depending on
the storage device, another one used at boot time for loading device
drivers and working with the debugging console, sometimes another one for
specific areas like the registry, for compatiblity reasons).

Each API provides its own supported encoding(s). Not all "code pages"
supported by Windows are supported by all APIs or devices, and Windows will
convert them dynamically at run time using either Unicode or the
OS-specific OEM code page for old 16-bit systems. If the device or API
supports the full UCS (using one of the UTFs), it will be relatively
reliable, but if not, the conversion to any other encoding scheme will
frequently be lossy (and will use some internal remappings, e.g. for
filenames on old FAT filesystems).

Sometimes Windows cannot reliably determine the encoding schemes supported
by the API or device, because it does not reliably report it (e.g. on old
FAT12 floppies, or on FTP filesystems and in legacy HTTP/1.0, or when
mounting an Unix or legacy MacOS filesystem on Windows with some drivers,
because these filesystems were also not reporting the encoding scheme
reliably).

The command line "CHCP" tool (or "MODE CON CP ..." just allows
preparing/declaring a device with the codepage that it should support, even
if it does not specify it reliably). This is not specific to Windows,
because other OSes may also need to use some default assumptions (using the
default local OS settings as the default when mounting an existing
filesystem which may have been prepared somewhere else).

There are similar issues when mounting database files or connecting with a
remote RDBMS : you sometimes need to specify the encoding scheme in the
mounting options. An in all cases, there are different behaviors if the
mounted filesystem or device or database does not use a standard UTF : the
API may report exceptions, or may choose a lossy conversion with
"approximate" character remappings, or with a default replacement character.

In the API does not report any exception, it may be needed to reread the
content of the stored data to see how it was converted, even if the API
used by the application uses the UCS (generally UTF-16 for the Win32
"Unicode" API).

In some cases Windows will store two simultanous representations : an UTF,
and a lossy legacy encoding (e.g. on FAT32, where LFN filenames are UTF-16
encoded, and also remapped for compatibiliuty with another encoding scheme
or "code page" plus some other conversion rules to make some files get
distinct filenames in the legacy encoding, using some dynamically generated
suffixes preserving only a part of the name with a truncated file
extension). If the filesystem is NTFS, you may specify to Wundows to not
generate and store these additional filenames (acting like supplementary
aliases, which may be made visible or not in the Unicoe API or only visible
with another compatibility interface which may also see, or not see, the
UCS-encoded filenames).
Received on Wed Jul 31 2013 - 07:44:11 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 31 2013 - 07:44:12 CDT