RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Dec 11 2004 - 05:29:16 CST

Next message: Michael Everson: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"

Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Maybe in reply to: Doug Ewell: "Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Arcane Jill responded:
> >> Windows filesystems do know what encoding they use.
> >Err, not really. MS-DOS *need to know* the encoding to use,
> >a bit like a
> >*nix application that displays filenames need to know the
> >encoding to use
> >the correct set of glyphs (but constrainst are much more heavy.)
>
> Sure, but MS-DOS is not Windows. MS-DOS uses "8.3" filenames.
> But it's not
> like MS-DOS is still terrifically popular these days.
I don't know what Antoine meant by MS-DOS, but since he mentioned it in the
Windows context, I thought it was about Windows console applications
(console is still often referred to as DOS box, I think).

> The fact that applications can still open files using the
> legacy fopen()
> call (which requires char*, hence 8-bit-wide, strings) is kind of
> irrelevant. If the user creates a file using fopen() via a code page
> translation, AND GETS IT WRONG, then the file will be created
> with Unicode
> characters other than those she - but those characters will
> still be Unicode
> and unambiguous, no?
Funny thing. Nobody cares much if a Latin 2 string is misinterpreted and
Latin 1 conversion is used instead. As long as they can create the file. But
if a Latin 2 string is misinterpreted and UTF-8 conversion is used? You
won't just get the filename with charaters other than those you expected.
Either the file won't open at all (depending on where and how the validation
is done), or you risk that two files you create one after another will
overwrite each other. Note that I am talking about files you create from
within this scenario, not files that existed on the disk before.

Second thing: OK, you say fopen is a legacy call. True, you can use _wfopen.
So, you can have a console application in Unicode and all problems are
solved? No. Standard input and standard output are 8-bit, and a code page is
used. And it has to remain so, if you want the old and the new applications
to be able to communicate. So, the logical conclusion is that UTF-8 needs to
be used instead of a code page. Unfortunately, Windows has problems with
that. Try MODE CON: CP SELECT=65001. Much of it works, but batch files don't
run.

Now suppose Windows does work correctly with code page set to UTF-8. You
create an application that reads the stdin, counts the words longer than 10
codepoints and passes the input unmodified to stdout. What happens:
* set CP to Latin 1, process Latin 1: correct result
* set CP to Latin 1, process UTF-8: wrong result
* set CP to UTF-8, process UTF-8: correct result
* set CP to UTF-8, process Latin 1: wrong restlt, corrupted output

Now, I wonder why Windows is not supporting UTF-8 as much as one would
want.....

Lars

Next message: Michael Everson: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"
Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Maybe in reply to: Doug Ewell: "Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 05:36:05 CST