RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Jan 20 2005 - 09:29:15 CST

  • Next message: Lars Kristan: "RE: 32'nd bit & UTF-8"

    Philippe Verdy replied:
    >
    > > * On UNIX, files are LF delimited.
    > > * On UNIX, fopen is always binary, text mode is rare and even then
    > > very
    > > close to binary.
    > > * On UNIX, text mode often doesn't apply, so UNIX is by default in
    > > binary
    > > mode.
    > > * On UNIX, filenames are presented and treated as opaque strings.
    > > * On UNIX, filenames are therefore case sensitive.
    > > * On UNIX, if a BOM is used in UTF-8 streams, most
    > everything breaks.
    > > * On UNIX, command line and scripting is very strong, the
    > > vulnerability to
    > > UTF-8 BOM is very high.
    >
    > All these are Unix deficiencies.

    And stregths as well.
    * The size of UNIX files equals the number of bytes in the string(s)
    written. Seeks are simple and efficient.
    * You can't mess up things by forgetting to set the binary mode.
    * You can have filenames in more than one encoding on a system. A deficiency
    now, but very useful at the time.
    * Case sensitivity can be ambiguous in different languages, can it not? Even
    more so when an old system is fed with Unicode characters introduced later
    on.
    * UNIX does not drop data when an odd encoding creeps in. At least not until
    processing it heavily.
    And I hope strong scripting is not a deficiency.

    > If there had been such simple attribute, the filesystems could have
    > been tuned up to support some transparent transformation, such as the
    > elimination of CRLF/CR/LF conventions, the transparent filtering of
    > BOMs according to the application preference.

    The filesystem seems like a central place to address it, yet this a simple
    but wrong solution. Pipes need the meta tag also. So, it's the run time
    library that needs to address it. Hence, interoperability problems between
    old and new programs.

    > Those working on a revision of POSIX and ANSI C/C++ should
    > now consider
    > the need for reliable and portable (documented and approved
    > by a formal
    > standard) ways to support meta-data on filesystems across multiple
    > OSes. This would really help the IT community.

    Maybe. I am not sure it would work. Nor that it can be implemented and put
    into action soon enough.

    UNIX has an approach which doesn't choke on mixed 8-bit data. It delays the
    problem, often as far as until the display time. Sometimes nobody cares and
    no action is required. That's user friendly. Microsoft (effectively) defines
    user friendly as proper display. At the cost of often choking early in the
    process, dropping data, popping up dialogs...

    Most UNIX users are used to dealing with more than one encoding. Which means
    they can survive the transition. Once most of the data (and filenames) is in
    UTF-8, UNIX will be just fine with its current approach. Even better than at
    the time of numerous encodings. Except for the BOM. And, yes, it will work
    only if UTF-8 will be the prevailing format.

    Such approach would be very simple and efficient. Some flavors will
    definitely adopt it. We'll have to wait and see if they can get away with
    it. Oh, BTW, they will need to deal with invalid sequences. Unicode should
    provide the means for them to do so. Just like it allows the BOM in UTF-8.

    Lars



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 09:30:22 CST