RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 09:40:39 CST

Next message: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"

Previous message: Jon Hanna: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Maybe in reply to: Lars Kristan: "UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Next in thread: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Antoine Leca wrote:

> But it is when you try to port it outside that problems
> surge. Particularly
> when the same file is written "w" and read "rb", by the same
> set of programs
> (yes, it happens).

How about using _fmode? Whoever can't display the result (Notepad?) should
be blamed. If you can't read other's files, you should consider fixing it
anyway, might as well encounter them on UNIX. But, yes, I know it's not
always as simple as that. But if most of those files is internal, you might
get results quicker.

> Windows is probably a necessary evil, but this is not an
> excuse to take the
> short track and do the things in a way that only raise problems to the
> others.

UNIX causes problems to Windows, Windows causes problems to UNIX. And the
short track is the quickest. You can chose the other track. It typically
takes longer, maybe because it swaps.

But on a more serious note - defining BOM to be quasi harmless so it can
sometimes be considered to be part of the text is also the short track.
Those who use it that way are doing things in a way that only raises
problems to the others.

UNIX gracefully processes binary data as text. Some call it processing
garbage. Well, as UTF-8 prevails, the garbage will gradually be gone. I
think it is not that bad to process garbage for a while - better than make
the effort to clean and label all garbage, only to discard most of it soon
thereafter.

Windows chooses BOM. If UNIX data is garbage, BOM is clutter. Just like the
CRs. Processing clutter is not cheap. Hence, Windows swaps.

> Of course, this is not against you, there is nothing personnal here.

I wouldn't expect otherwise. I suppose it must look like I am a UNIX fan.
Well, I just sit inbetween and get the problems of both. And it looks like
it's the UNIX that needs to be able to process invalid sequences. Actually
Windows needs it more.

> > BTW (yes, again and again): This is something Windows is not able
> > to achieve.
>
> What do you mean (simple curiosity, I did not get your point)?
> Yes Windows does extravagant contortions about codepages with
> filenames (and
> this is disappearing, fortunately), but that should be
> irrelevant. Yes, C
> programs on Windows "eat" CR.
> But I fail to see sigificant examples beyond those (I
> consider the "feature"
> of BOM at the beginning of Notepad/RichEdit/whatever UTF-8
> files to be an
> outright bugbug that had been missed in due time and is now
> entrenched).

Not a bug. Windows relies on it. At least in Notepad. But it's true that it
is sometimes emitted also when it shouldn't be. But this is because it is
hard to determine when to do what. Hard to even define, let alone determine
at run time.

As to what I meant - well the answer should be in the other posts. In short,
Windows is not able to process data in an unknown encoding. Not without a
risk of losing data. And it typically does so silently, yielding incorrect
results.

> Sorry: you mean, a Unicode application is able to "absorb" any stream
> (including erroneous encodings) when programmed in UTF-8
> while unable when
> programmed in UTF-16?
> I would have expected just the reverse (because of the
> requirements for the
> illegality of the overlong encodings).

Interesting choice of words. Absorb. If the application only consumes
(absorbs) data, then dropping unrecognised data (invalid sequences or
unassigned values in legacy encodings) often doesn't matter.

If you however need to change the data and pass it on? Then dropping data is
a problem. Even more when it is done silently. And refusing the stream is
not always an answer. Especially if you already processed and passed on half
of it.

Now an additon to the first scenario, absorption. Dropping data is sometimes
a problem there too. OK, Windows has a habit of really dropping invalid
sequences, meaning even a simple word count might fail. Next, you fix it and
replace invalid sequences with 0xFFFD. Word count now works (at least on
reasonable data, which is what I need). What if I then choose to search for
0xFFFD? And mean 0xFFFD?! It's close to searching for a '?', except I have
no way of telling I do not want it to be treated as a wildcard.

UTF-16 is very good at absorbing. Too good. But it's not its fault, it's the
fault of the UTF-8 to UTF-16 conversion. UTF-8 is good at preserving.

Of course none of that is true if you consider those two applications as
conformant. If you think only conformant applications and what happens if
you feed them non-conformant data, then it is indeed easier to implement
validation in a UTF-16 application than in a UTF-8 application. The problem
is that UTF-8 to UTF-16 conversion implies partial validation, while UTF-16
to UTF-8 conversion does not necessarily validate anything.

Lars

Next message: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
Previous message: Jon Hanna: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Maybe in reply to: Lars Kristan: "UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Next in thread: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 09:45:02 CST