Re: Names for UTF-8 with and without BOM - pragmatic

From: Markus Scherer (
Date: Wed Nov 06 2002 - 12:47:43 EST

  • Next message: William Overington: "A .notdef glyph (derives from Re: ct, fj and blackletter ligatures)"

    Lars Kristan wrote:
    > Markus Scherer wrote:
    >>If software claims that it does not modify the contents of a
    >>document *except* for initial U+FEFF
    >>then it can do with initial U+FEFF what it wants. If the
    >>whole discussion hinges on what is allowed
    >><em>if software claims to not modify text</em> then one need
    >>not claim that so absolutely.
    > That seems pretty straightforward, but only as long as your "software" is an
    > editor and your "document" is a single file. How about a case where
    > "software" is a copy or cat command, and instead of a document you have
    > several (plain?) text files that you concat? What does "initial" mean here?

    Initial for each piece, as each is assumed to be a complete text file before concatenation. Nothing
    prevents copy/cp/cat and other commands from recognizing Unicode signatures, for as long as they
    don't claim to preserve initial U+FEFF.

    > What happens next is: some software lets an initial BOM get through and
    > appends such string to a file or a stream. If other software treats it as a
    > character, the data has been modified. On the other hand, if we want to
    > allow software to disregard BOMs in the middle of character streams then we
    > have another set of security issues. And not removing is equally bad because
    > of many consequences (in the end, we could end up with every character being
    > preceded by a BOM).

    All true, and all well known, and the reason why the UTC and WG2 added U+2060 Word Joiner. This
    becomes less of an issue if and when they decide to remove/deprecate the ZWNBSP semantics from U+FEFF.

    However, in a situation where you cannot be sure about the intended purpose of an initial U+FEFF I
    think that the "pragmatic" approach is any less safe than any other, while it increases usability.

    >>.txt UTF-8 require We want plain text files to
    >> have BOM to distinguish
    >> from legacy codepage files
    > Hmmmm, what does "plain" mean?! ...

    Your response to this takes it out of context. I am not trying to prescribe general semantics of
    .txt plain text files.

    If you read the thread carefully, you will see that I am just taking the file checker configuration
    file from Joseph Boyle and suggesting a modification to its format that makes it not rely on having
    charset names that indicate any particular BOM handling. I am sorry to not have made this clearer.

    > True, UTF-16 files do need a signature. Well, we just need to abandon the
    > idea that UTF-16 can be used for plain text files. Let's have plain text
    > files in UTF-8. Look at it as the most universal code page. Plain text files
    > never contained information about the code page, why would there be such
    > information in UTF-8 plain text files?!

    UTF-16 files do not *need* a signature per se. However, it is very useful to prepend Unicode plain
    text *files* with Unicode signatures so that tools have a chance to figure out if those files are in
    Unicode at all - and which Unicode charset - or in some legacy charset. With "plain text files" I
    mean plain text documents without any markup or other meta information.

    The fact is that Windows uses UTF-8 and UTF-16 plain text files with signatures (BOMs) very simply,
    gracefully, and successfully. It has applied what I called the "pragmatic" approach here for about
    10 years. It just works.


    Opinions expressed here may not reflect my company's positions unless otherwise noted.

    This archive was generated by hypermail 2.1.5 : Wed Nov 06 2002 - 13:33:54 EST