RE: Subject: Re: 32'nd bit & UTF-8

From: Lars Kristan (lars.kristan@hermes.si)
Date: Fri Jan 21 2005 - 16:20:30 CST

  • Next message: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"

    Richard Gillam wrote:

    > The current committee is EXTREMELY vigilant
    > and won't let these things happen.

    I thing this vigilance (though typically warranted) has gotten too far. So
    far in fact that even though it would be possible to prevent losing data
    when encountering invalid sequences in UTF-8, it still looks like the cost
    of 128 codepoints is too much. Yes, the million codepoints will suffice for
    a very long time.

    > That sequence has no special meaning in UTF-8;
    > it's just a zero-width non-breaking space.

    Ummmmm, is it really? Thought it no longer is. And in this particular case
    it definitely is not. It is not a part of the text is it? It is a part (well
    the only part) of the document that encapsulates the text. If your text is a
    letter, you won't see the difference, but if the text is a batch file, then
    you will. Or will you say programs are not written in plain text?

    > So EF BB BF at the top of the UTF-8 file does occur
    > in practice and it's good for software to be aware
    > of it (but relatively harmless if it isn't).

    Again, think beyond books and letters. Think plain text databases (a list of
    files to process, if you wish), batch files (called scripts on UNIX), and a
    number of other things, like configuration files. Harmless? On the contrary.
    But we shouldn't really be talking about whether or not to have the BOM. We
    should be talking about how to make things work. Either with the BOM,
    without the BOM, or with the BOM being optional. Those are the three paths
    we could take. It is not even necessary that everybody will take the same
    path.

    Lars



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 16:27:16 CST