RE: 32'nd bit & UTF-8

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Jan 20 2005 - 09:36:08 CST

  • Next message: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"

    Arcane Jill wrote:
    > > You are drawing this analogue too far, because it is fairly
    > easy to fix the
    > > \r\n problem, whereas the BOM problem runs deeper. The
    > latter changes the
    > > very paradigm for file representation.
    >
    > I don't see why. What is the difference between discarding
    > U+000Ds and
    > discarding U+FEFFs ?

    There is some difference. You can concat two files containing U+000Ds,
    blindly. You shouldn't do that for leading U+FEFFs. Then, a text processing
    process can drop the U+000Ds quite safely, knowing exactly what they
    represent. Dropping three consecutive bytes is another story. Especially
    since at the time you process, you might not even know if this is the
    beginning of a file or not (say, processing the output of a grep command).

    The analogy between CRLF and BOM is just in the location where it needs to
    be fixed. Probably. More or less. But fixing CRLF is easier. And often you
    can get away with not fixing it at all.

    Lars



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 09:37:04 CST