Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Thu Dec 09 2004 - 07:48:48 CST

  • Next message: Azzedine Ait Khelifa: "Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."

    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
    Behalf Of Antoine Leca
    Sent: 09 December 2004 11:29
    To: Unicode Mailing List
    Subject: Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

    >> Windows filesystems do know what encoding they use.
    >Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
    *nix application that displays filenames need to know the encoding to use
    the correct set of glyphs (but constrainst are much more heavy.)

    Sure, but MS-DOS is not Windows. MS-DOS uses "8.3" filenames. But it's not
    like MS-DOS is still terrifically popular these days.

    >But when it comes to other Windows applications (still the more common)
    >that
    happen to operate in 'Ansi' mode, they are subject to the hazard of codepage
    translations.

    Sure, but this has got nothing to do with the filesystem. The Windows
    filesystem(s) store filenames in those disk sectors which are reserved for
    file headers, and in these location they are stored using sixteen-bit wide
    code units. (I assume this can only be UTF-16?). Thus, "Windows file systems
    do know what encodings they use" seems to me to be a correct statement.

    The fact that applications can still open files using the legacy fopen()
    call (which requires char*, hence 8-bit-wide, strings) is kind of
    irrelevant. If the user creates a file using fopen() via a code page
    translation, AND GETS IT WRONG, then the file will be created with Unicode
    characters other than those she - but those characters will still be Unicode
    and unambiguous, no?

    >that is, usually, it is restricted to US ASCII, very much like the usable
    set in *nix cases...

    [OFF TOPIC] Why do so many people call it "US ASCII" anyway? Since "ASCII"
    comprises that subset of Unicode from U+0000 to U+007F, it is not clear to
    me in what way "US-ASCII" is different from ASCII. It's bad enough for us
    non-Americans that the A in ASCII already stands for "American", but to
    stick "US" on the front as well is just .... Anyway, back to the discussion
    on US-Unicode...



    This archive was generated by hypermail 2.1.5 : Thu Dec 09 2004 - 07:58:43 CST