Re: Unicode in VFAT file system

From: Ken Krugler (
Date: Thu Jul 20 2000 - 15:28:13 EDT

Hi Addison,

>UCS-2 is pretty close to the same thing as UTF-16. The differences do not
>apply here.
>UCS-2 can be big-endian or little-endian. The rule is that BE is the
>default. However, on Intel platforms, you shouldn't be surprised to see LE
>everywhere: that's the architecture. Microsoft is saving two bytes for
>every filename by not storing a BOM.

Thanks for the fast response. I was basing my understanding of UCS-2
always being big-endian on Marcus Kuhn's prior email, which said:

At 2:58am -0800 00-02-18, Markus Kuhn wrote:
>Date: Fri, 18 Feb 2000 02:58:51 -0800 (PST)
>From: Markus Kuhn <>
>Subject: Re: UCS-4, UCS-2, UTF-16, UTF-8
>To: Unicode List <>
>X-UML-Sequence: 12380 (2000-02-18 10:58:53 GMT)
>Yung-Fong Tang wrote on 2000-02-17 21:18 UTC:
> > UCS-4 does not specify byte order, but UTF-32BE and
> > UTF-32LE does.
>No. UCS-2 and UCS-4 have always been bigendian. Read ISO 10646-1:1993,
>section "6.3 Octet order" (page 7):
> When serialized as octets, a more significant octet shall
> precede less significant octets.
>ISO and ITU have fortunately always frowned upon Intel's horrible 1970s
>decision of staying compatible with some obscure long-forgotten 1960s
>mainframe for which they had bought some software when they made the
>8080 a littleendian processor (Intel's microcontrollers by the way are
>all bigendian, as is pretty much anything else that was not designed to
>be Intel compatible).

So now I'm a bit confused, since I've never heard of UCS-2LE/UCS-2BE.

>You should note that Microsoft *means* UCS-2LE (and UTF-16LE in more
>modern systems) when they say "Unicode" (at least on Intel platforms).
>1. Yes, it is perfectly valid.
>2. There are no characters in the surrogate space just yet, so a black
>square should be no surprise. Two black squares means that it's being
>treated as UCS-2.

Does anybody know if Microsoft has publicly stated if/when they'll
support surrogates in VFAT file names?

>3. Filenames are, by definition in Windows-land, UPPERCASE in Western
>European systems.

My understanding is that with DOS they were always upper-cased, but
probably only for the Western European code pages. With VFAT, the
file names are stored as-is, but checked for uniqueness using
case-insensitivity (but only in the basic Latin and Latin-1
supplement range).

>Other scripts either don't have the concept of case or
>weren't mucked with. This includes compatibility characters stored outside
>the U+0000 to U+00FF range.

OK - this matches the behavior I was seeing with Japanese Windows
systems, where full-width Romaji isn't case-folded before checking
file names.


-- Ken

>Addison P. Phillips Principal Consultant
>Inter-Locale LLC
>Los Gatos, CA, USA
>+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
>Globalization Engineering & Consulting Services
>On Thu, 20 Jul 2000, Ken Krugler wrote:
> > Hi Unicoders,
> >
> > Recently I've had the dubious pleasure of delving into the details of
> > the VFAT file system. For long file names, I thought it used UCS-2,
> > but in looking at the data with a disk editor, it appears to be
> > byte-swapping (little endian). I thought that UCS-2 was by definition
> > big endian, thus I've got the following questions:
> >
> > 1. Could it be using UTF-16LE? I tried creating an entry with a
> > surrogate pair, but the name was displayed with two black boxes on a
> > Windows 2000-based computer, so I assumed that surrogates were not
> > supported.
> >
> > 2. Is little-endian UCS-2 a valid encoding that I just don't know about?
> >
> > 3. And finally, why are file names case-insensitive for characters in
> > the U-0000 to U-00FF range, but not for any other characters? OK,
> > maybe I can guess at the answer to that one...
> >
> > Thanks,
> >
> > -- Ken
> > Ken Krugler
> > TransPac Software, Inc.
> > <>
> > +1 530-470-9200
> >

Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT