Re: Names for UTF-8 with and without BOM

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Sun Nov 03 2002 - 16:02:59 EST

  • Next message: Doug Ewell: "Re: Names for UTF-8 with and without BOM"

    From: "Mark Davis" <mark.davis@jtcsv.com>

    Ironic that for the purpose of dealing with THREE bytes that so many bytes
    are being wasted. :-)

    > Little probability that right double quote would appear at the start of a
    > document either. Doesn't mean that you are free to delete it (*and* say
    that
    > you are not modifying the contents).

    Interesting strawman there, Mark -- but there is a huge difference there.
    But even if we leave in the notion of it as a character and just deprecate
    its usage and people ignore that, then we are talking about a ZERO WIDTH NO
    BREAK SPACE. This character has the job of:

    1) being invisible
    2) not breaking text with it

    So even if it were in there, who cares? I mean, can anyone explain why it
    would make a difference?

    The one thing that no one has ever come up with is a reasonable case where
    it would be at the beginning of the document *yet* it was not a BOM.

    So we have a clear semantic for it at the beginning of a file -- its a BOM.
    Period.

    If there is a higher level protocol as well and the protocol and the BOM
    both match, then that is great! Considering how much redundancy there is in
    the Unicode standard about some definitions, a redundant marker for a file
    seems a very trivial issue.

    If there is a higher level protocol as well and they do not match, then we
    are in fantasy land bizarro world, inventing edge cases because we have
    nothing better to do. :-) But for the sake of argument, lets pretend its a
    real scenario -- in which case we treat it the same way as if your higher
    level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an
    error.

    Problem solved!

    > I agree that when the UTC decides that a BOM is *only* to be used as a
    > signature, and that it would be ok to delete it anywhere in a document
    (like
    > a non-character), then we are in much better shape. This was, as a matter
    of
    > fact proposed for 3.2, but not approved. If we did that for 4.0, then
    there
    > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
    > 'withoutBOM'.

    There is no reason to worry about this case and no need to delete anything.
    This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on
    the people who think this is a scenario to bring proof that anyone is doing
    anything as unrealistic as this.

    There is an easy, clear, and unambigous plan that can be used here which
    will always work. For ones lets not opt to complicate it without reason.

    MichKa



    This archive was generated by hypermail 2.1.5 : Sun Nov 03 2002 - 16:35:35 EST