Re: Names for UTF-8 with and without BOM

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Sun Nov 03 2002 - 16:02:59 EST

Next message: Doug Ewell: "Re: Names for UTF-8 with and without BOM"

Previous message: Mark Davis: "Re: Names for UTF-8 with and without BOM"
In reply to: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Next in thread: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Reply: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Reply: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Reply: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Mark Davis" <mark.davis@jtcsv.com>

Ironic that for the purpose of dealing with THREE bytes that so many bytes
are being wasted. :-)

> Little probability that right double quote would appear at the start of a
> document either. Doesn't mean that you are free to delete it (*and* say
that
> you are not modifying the contents).

Interesting strawman there, Mark -- but there is a huge difference there.
But even if we leave in the notion of it as a character and just deprecate
its usage and people ignore that, then we are talking about a ZERO WIDTH NO
BREAK SPACE. This character has the job of:

1) being invisible
2) not breaking text with it

So even if it were in there, who cares? I mean, can anyone explain why it
would make a difference?

The one thing that no one has ever come up with is a reasonable case where
it would be at the beginning of the document *yet* it was not a BOM.

So we have a clear semantic for it at the beginning of a file -- its a BOM.
Period.

If there is a higher level protocol as well and the protocol and the BOM
both match, then that is great! Considering how much redundancy there is in
the Unicode standard about some definitions, a redundant marker for a file
seems a very trivial issue.

If there is a higher level protocol as well and they do not match, then we
are in fantasy land bizarro world, inventing edge cases because we have
nothing better to do. :-) But for the sake of argument, lets pretend its a
real scenario -- in which case we treat it the same way as if your higher
level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an
error.

Problem solved!

> I agree that when the UTC decides that a BOM is *only* to be used as a
> signature, and that it would be ok to delete it anywhere in a document
(like
> a non-character), then we are in much better shape. This was, as a matter
of
> fact proposed for 3.2, but not approved. If we did that for 4.0, then
there
> would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
> 'withoutBOM'.

There is no reason to worry about this case and no need to delete anything.
This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on
the people who think this is a scenario to bring proof that anyone is doing
anything as unrealistic as this.

There is an easy, clear, and unambigous plan that can be used here which
will always work. For ones lets not opt to complicate it without reason.

MichKa

Next message: Doug Ewell: "Re: Names for UTF-8 with and without BOM"
Previous message: Mark Davis: "Re: Names for UTF-8 with and without BOM"
In reply to: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Next in thread: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Reply: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Reply: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Reply: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Nov 03 2002 - 16:35:35 EST