Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Thu Dec 09 2004 - 07:48:48 CST

Next message: Azzedine Ait Khelifa: "Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."

Previous message: Antoine Leca: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Maybe in reply to: Doug Ewell: "Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Doug Ewell: "US-ASCII (was: Re: Invalid UTF-8 sequences)"
Reply: Doug Ewell: "US-ASCII (was: Re: Invalid UTF-8 sequences)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Antoine Leca
Sent: 09 December 2004 11:29
To: Unicode Mailing List
Subject: Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

>> Windows filesystems do know what encoding they use.
>Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
*nix application that displays filenames need to know the encoding to use
the correct set of glyphs (but constrainst are much more heavy.)

Sure, but MS-DOS is not Windows. MS-DOS uses "8.3" filenames. But it's not
like MS-DOS is still terrifically popular these days.

>But when it comes to other Windows applications (still the more common)
>that
happen to operate in 'Ansi' mode, they are subject to the hazard of codepage
translations.

Sure, but this has got nothing to do with the filesystem. The Windows
filesystem(s) store filenames in those disk sectors which are reserved for
file headers, and in these location they are stored using sixteen-bit wide
code units. (I assume this can only be UTF-16?). Thus, "Windows file systems
do know what encodings they use" seems to me to be a correct statement.

The fact that applications can still open files using the legacy fopen()
call (which requires char*, hence 8-bit-wide, strings) is kind of
irrelevant. If the user creates a file using fopen() via a code page
translation, AND GETS IT WRONG, then the file will be created with Unicode
characters other than those she - but those characters will still be Unicode
and unambiguous, no?

>that is, usually, it is restricted to US ASCII, very much like the usable
set in *nix cases...

[OFF TOPIC] Why do so many people call it "US ASCII" anyway? Since "ASCII"
comprises that subset of Unicode from U+0000 to U+007F, it is not clear to
me in what way "US-ASCII" is different from ASCII. It's bad enough for us
non-Americans that the A in ASCII already stands for "American", but to
stick "US" on the front as well is just .... Anyway, back to the discussion
on US-Unicode...

Next message: Azzedine Ait Khelifa: "Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."
Previous message: Antoine Leca: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Maybe in reply to: Doug Ewell: "Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Doug Ewell: "US-ASCII (was: Re: Invalid UTF-8 sequences)"
Reply: Doug Ewell: "US-ASCII (was: Re: Invalid UTF-8 sequences)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Dec 09 2004 - 07:58:43 CST