RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Dec 11 2004 - 05:29:16 CST

  • Next message: Michael Everson: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"

    Arcane Jill responded:
    > >> Windows filesystems do know what encoding they use.
    > >Err, not really. MS-DOS *need to know* the encoding to use,
    > >a bit like a
    > >*nix application that displays filenames need to know the
    > >encoding to use
    > >the correct set of glyphs (but constrainst are much more heavy.)
    >
    > Sure, but MS-DOS is not Windows. MS-DOS uses "8.3" filenames.
    > But it's not
    > like MS-DOS is still terrifically popular these days.
    I don't know what Antoine meant by MS-DOS, but since he mentioned it in the
    Windows context, I thought it was about Windows console applications
    (console is still often referred to as DOS box, I think).

    > The fact that applications can still open files using the
    > legacy fopen()
    > call (which requires char*, hence 8-bit-wide, strings) is kind of
    > irrelevant. If the user creates a file using fopen() via a code page
    > translation, AND GETS IT WRONG, then the file will be created
    > with Unicode
    > characters other than those she - but those characters will
    > still be Unicode
    > and unambiguous, no?
    Funny thing. Nobody cares much if a Latin 2 string is misinterpreted and
    Latin 1 conversion is used instead. As long as they can create the file. But
    if a Latin 2 string is misinterpreted and UTF-8 conversion is used? You
    won't just get the filename with charaters other than those you expected.
    Either the file won't open at all (depending on where and how the validation
    is done), or you risk that two files you create one after another will
    overwrite each other. Note that I am talking about files you create from
    within this scenario, not files that existed on the disk before.

    Second thing: OK, you say fopen is a legacy call. True, you can use _wfopen.
    So, you can have a console application in Unicode and all problems are
    solved? No. Standard input and standard output are 8-bit, and a code page is
    used. And it has to remain so, if you want the old and the new applications
    to be able to communicate. So, the logical conclusion is that UTF-8 needs to
    be used instead of a code page. Unfortunately, Windows has problems with
    that. Try MODE CON: CP SELECT=65001. Much of it works, but batch files don't
    run.

    Now suppose Windows does work correctly with code page set to UTF-8. You
    create an application that reads the stdin, counts the words longer than 10
    codepoints and passes the input unmodified to stdout. What happens:
    * set CP to Latin 1, process Latin 1: correct result
    * set CP to Latin 1, process UTF-8: wrong result
    * set CP to UTF-8, process UTF-8: correct result
    * set CP to UTF-8, process Latin 1: wrong restlt, corrupted output

    Now, I wonder why Windows is not supporting UTF-8 as much as one would
    want.....

    Lars



    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 05:36:05 CST