Names for UTF-8 with and without BOM

From: Joseph Boyle (Boyle@siebel.com)
Date: Fri Nov 01 2002 - 14:47:59 EST

  • Next message: Kenneth Whistler: "Re: Names for UTF-8 with and without BOM"

    It would be useful to have official names to distinguish UTF-8 with and
    without BOM. (or, with, without, and agnostic) Here are a couple of examples
    I'm currently involved with:

    * I'm writing an encoding checker to validate a long list of text file
    formats we use internally. HTML and XML only count as one format each; most
    cases are file formats originated by one of our development groups without
    regard to encoding issues, and which we've now tried to standardize on UTF-8
    with BOM to distinguish from ASCII or codepage legacy files while still
    allowing legacy files to work. In the list of file formats, the encoding
    constraint field needs to distinguish UTF-8 with BOM from UTF-8 without BOM.
    * We need an encoding conversion tool for text files that can output both
    UTF-8 with BOM and UTF-8 without BOM. Current tools like ICU's uconv do not
    support output of UTF-8 with BOM. It would be possible to add an input
    switch for BOM/no BOM distinct from the output charset specifier, but this
    is an ugly solution as it is not needed for any other encoding, even UTF-16
    and UTF-32 which have separate charset names for the with-BOM and
    without-BOM variants. I've discussed with Markus Scherer who would also
    prefer distinct charset names as the means to distinguish BOM and no-BOM.

    Mark Davis introduced UTF-8N for UTF-8 with no BOM a couple of years ago,
    which seems to have some currency especially on Japanese sites for some
    reason. This is the only convention I can find, and might adopt it if
    nothing else is available. However, it does not seem to have any official
    status with Unicode Consortium or IETF, and while making UTF-8 mean with-BOM
    would be convenient enough for us internally, I am sure some other users
    would object strongly.

    How about if we let UTF-8 keep its current status as neither requiring nor
    forbidding BOM, make UTF-8N official for no-BOM, and coin another name for
    with-BOM? Let's call it UTF-8BOM for the moment. Behavior for each would be:

                UTF-8BOM UTF-8N UTF-8
    Producers Produce BOM Don't produce Optional (higher protocols using
    UTF-8 can recommend)
    Consumers Consume BOM Don't consume Should probably strip BOM since
    initial ZWNBSP not likely
    Checkers Require BOM Forbid Optional (higher protocols using
    UTF-8 can forbid or require)

    (I realize UTF-8 Byte Order Mark is an oxymoron, however BOM is established
    and shorter to type than "signature", and does not cause confusion unless
    you run into release managers talking about Bills Of Materials. Perhaps it
    is time to think of three other words starting with B, O, M that make a
    better explanation.)



    This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 15:36:31 EST