RE: Names for UTF-8 with and without BOM - pragmatic

From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Nov 06 2002 - 05:49:01 EST

Next message: William Overington: "Re: ct, fj and blackletter ligatures"

Previous message: Marco Cimarosti: "RE: Unicodes For Devanagari: Magic The Gathering Card"
Next in thread: Marco Cimarosti: "RE: Names for UTF-8 with and without BOM - pragmatic"
Maybe reply: Marco Cimarosti: "RE: Names for UTF-8 with and without BOM - pragmatic"
Reply: Kent Karlsson: "RE: Names for UTF-8 with and without BOM - pragmatic"
Reply: Markus Scherer: "Re: Names for UTF-8 with and without BOM - pragmatic"
Maybe reply: Joseph Boyle: "RE: Names for UTF-8 with and without BOM - pragmatic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus Scherer wrote:
> If software claims that it does not modify the contents of a
> document *except* for initial U+FEFF
> then it can do with initial U+FEFF what it wants. If the
> whole discussion hinges on what is allowed
> <em>if software claims to not modify text</em> then one need
> not claim that so absolutely.

That seems pretty straightforward, but only as long as your "software" is an
editor and your "document" is a single file. How about a case where
"software" is a copy or cat command, and instead of a document you have
several (plain?) text files that you concat? What does "initial" mean here?

What happens next is: some software lets an initial BOM get through and
appends such string to a file or a stream. If other software treats it as a
character, the data has been modified. On the other hand, if we want to
allow software to disregard BOMs in the middle of character streams then we
have another set of security issues. And not removing is equally bad because
of many consequences (in the end, we could end up with every character being
preceded by a BOM).

> .txt UTF-8 require We want plain text files to
> have BOM to distinguish
> from legacy codepage files

Hmmmm, what does "plain" mean?! Perhaps files with a BOM should be called
"text" files (or .txt files;) as opposed to "plain text" files, which in my
opinion should be just that - _plain_ text. No ASCII plain text file had an
ASCII signature. I believe "plain text" should be something that will be as
easy to use (and handle) as ASCII plain text files were.

True, UTF-16 files do need a signature. Well, we just need to abandon the
idea that UTF-16 can be used for plain text files. Let's have plain text
files in UTF-8. Look at it as the most universal code page. Plain text files
never contained information about the code page, why would there be such
information in UTF-8 plain text files?!

How about this:
* BOM makes a file stateful.
* Plain text should NOT be stateful (or, we should make it as stateless as
possible)
* If a text file is stateful, it is no longer a "plain text file", it
becomes a "text document".

BTW, since I may be tempted to process text documents with plain text tools,
I would rather see that the text documents would NOT have the BOM (yes, that
effectively makes them plain text files). Since it seems that many people
will insist that they want the option to have the BOM in text documents, it
seems that it will need to be allowed. But I would not make it "required".

Lars Kristan

Next message: William Overington: "Re: ct, fj and blackletter ligatures"
Previous message: Marco Cimarosti: "RE: Unicodes For Devanagari: Magic The Gathering Card"
Next in thread: Marco Cimarosti: "RE: Names for UTF-8 with and without BOM - pragmatic"
Maybe reply: Marco Cimarosti: "RE: Names for UTF-8 with and without BOM - pragmatic"
Reply: Kent Karlsson: "RE: Names for UTF-8 with and without BOM - pragmatic"
Reply: Markus Scherer: "Re: Names for UTF-8 with and without BOM - pragmatic"
Maybe reply: Joseph Boyle: "RE: Names for UTF-8 with and without BOM - pragmatic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 06 2002 - 06:23:48 EST