Re: Unicode in VFAT file system

From: addison@inter-locale.com
Date: Fri Jul 21 2000 - 07:37:17 EDT


Hi Ken,

UCS-2 is pretty close to the same thing as UTF-16. The differences do not
apply here.

UCS-2 can be big-endian or little-endian. The rule is that BE is the
default. However, on Intel platforms, you shouldn't be surprised to see LE
everywhere: that's the architecture. Microsoft is saving two bytes for
every filename by not storing a BOM.

You should note that Microsoft *means* UCS-2LE (and UTF-16LE in more
modern systems) when they say "Unicode" (at least on Intel platforms).

So:

1. Yes, it is perfectly valid.
2. There are no characters in the surrogate space just yet, so a black
square should be no surprise. Two black squares means that it's being
treated as UCS-2.
3. Filenames are, by definition in Windows-land, UPPERCASE in Western
European systems. Other scripts either don't have the concept of case or
weren't mucked with. This includes compatibility characters stored outside
the U+0000 to U+00FF range.

Regards,

Addison

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

On Thu, 20 Jul 2000, Ken Krugler wrote:

> Hi Unicoders,
>
> Recently I've had the dubious pleasure of delving into the details of
> the VFAT file system. For long file names, I thought it used UCS-2,
> but in looking at the data with a disk editor, it appears to be
> byte-swapping (little endian). I thought that UCS-2 was by definition
> big endian, thus I've got the following questions:
>
> 1. Could it be using UTF-16LE? I tried creating an entry with a
> surrogate pair, but the name was displayed with two black boxes on a
> Windows 2000-based computer, so I assumed that surrogates were not
> supported.
>
> 2. Is little-endian UCS-2 a valid encoding that I just don't know about?
>
> 3. And finally, why are file names case-insensitive for characters in
> the U-0000 to U-00FF range, but not for any other characters? OK,
> maybe I can guess at the answer to that one...
>
> Thanks,
>
> -- Ken
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-470-9200
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT