Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Fri Jan 21 2005 - 13:12:03 CST

  • Next message: Richard T. Gillam: "RE: Conformance (was UTF, BOM, etc)"

    RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)OK, looks like I
    also was too terse.

    I wrote:
    >> fopen is normally text ("w"), binary mode ("wb") is rare and even
    >> then identical to text.

    First and third part we know each other what is about. You described at
    length.

    When I wrote:
    >> binary mode ("wb") is rare

    I intended to highlight that the intent of this "b" flag is often lost in
    programs from *nix heritage. And this is a problem.
    Granted, if the source never left the *nix world there is no actual problem.
    But it is when you try to port it outside that problems surge. Particularly
    when the same file is written "w" and read "rb", by the same set of programs
    (yes, it happens).

    Windows is probably a necessary evil, but this is not an excuse to take the
    short track and do the things in a way that only raise problems to the
    others.

    Of course, this is not against you, there is nothing personnal here.

    Lars Kristan answered:
    > BTW (yes, again and again): This is something Windows is not able
    > to achieve.

    What do you mean (simple curiosity, I did not get your point)?
    Yes Windows does extravagant contortions about codepages with filenames (and
    this is disappearing, fortunately), but that should be irrelevant. Yes, C
    programs on Windows "eat" CR.
    But I fail to see sigificant examples beyond those (I consider the "feature"
    of BOM at the beginning of Notepad/RichEdit/whatever UTF-8 files to be an
    outright bugbug that had been missed in due time and is now entrenched).

    > But that does not mean no Unicode application is able
    > to do it. Application that processes text in UTF-8 is also able to
    > do it. UTF-16 applications on the other hand are not.

    Sorry: you mean, a Unicode application is able to "absorb" any stream
    (including erroneous encodings) when programmed in UTF-8 while unable when
    programmed in UTF-16?
    I would have expected just the reverse (because of the requirements for the
    illegality of the overlong encodings).

    Antoine



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 13:16:52 CST