Re: HTML5 encodings

From: Doug Ewell (doug@ewellic.org)
Date: Fri Jan 01 2010 - 11:17:15 CST

  • Next message: Ed Trager: "Quick Question About Korean Input Methods"

    Happy New Year to all.

    "verdy_p" <verdy underscore p at wanadoo dot fr> wrote:

    >> Unicode, and even ASCII, contains plenty of seldom-used control
    >> characters, with defined semantics if that is desirable, which an
    >> internal process can safely insert, use, and remove for purposes like
    >> this.
    >
    > No, you're wrong, there's no such character. If it existed, then this
    > character would also have a use within normal strings that would be
    > part of a primary key, and that would break the logic. If it is
    > "seldom used", it does not qualify as it will conflict with this
    > seldom use, so it will unavoidably be UNUSABLE to insert/use/remove
    > for such purpose.

    If you are concerned that every possible control character, like U+009C
    STRING TERMINATOR or U+0081 <I don't have a name because nobody uses
    me>, might appear in the real text, then yes, this is a problem.

    > The BOCU-1 RESET code is NOT a character, and what I wrote was exactly
    > the kind of use where it can be beneficial, because BOCU-1 was
    > designed with the express purpose of being a binary-ordered encoding
    > suitable for collation according to code point's scalar values.

    Right, I'm aware the reset byte is not a character.

    > I DID NOT say that a RESET code neded to be inserted in the
    > plain-text, but its insertion with a collation key as a key separator
    > DOES NOT violate the rule, as we can completely warranty that it will:
    > - never present in encoded plan-texts
    > - will always sort AFTER any valid Unicode character
    > - will not be ignored.

    If you want to use a mechanism that is internal to BOCU-1 to serve a
    metadata purpose, be my guest. You will not be able to convert your
    data to any other encoding and still retain this metadata. If that is
    not a problem for you, great.

    Hopefully you read what I wrote about UTF-8 and tag characters, or
    remembered when it happened. It is a valuable lesson.

    > An I still maintain that the special RESET code in BOCU-1 should NEVER
    > be present in any encoded plain-text (as effectively it has the
    > potential of creating multiple distinct encodings for equivalent
    > texts).

    This is not an absolute rule of BOCU-1, and the authors indicate how it
    could be useful for concatenating strings, which seems to me a more
    common scenario than sorting multi-column text in BOCU-1 using only the
    untailored UCA.

    > So it does not absolutely need a leading BOM

    With its lack of transparency with ASCII or any other encoding, I can
    hardly think of an encoding that is more in need of a BOM than BOCU-1.

    > (My opinion is that, for interchange purpose, BOMS should be allowed
    > in ALL encodings if they can represent the U+FEFF codepoint,

    No argument there.

    > and that this codepoint should also exclusively represent a BOM and no
    > ZWNSP semantic

    Too late; U+FEFF nominally still has both semantics, but see below.

    > if needed one could replace all ZWNBSP by ZWJ, making sure that all
    > final renderers will either be able to render it).

    This is a hack. Developers of renderers should make ZWNBSP display
    correctly. It's not that hard. Creators of documents shouldn't have to
    modify their text to appease the renderer. And remember, it's
    default-ignorable.

    > All the legacy problems about the BOM would have been much simpler if
    > it had been mapped to a non-character (exactly like also U+FFFE)
    > instead of a legacy control format (like U+FEFF), but now it is too
    > late to change it or recommand some other codepoint.

    U+2060 WORD JOINER is recommended for the ZWNBSP semantic. And
    honestly, when was the last time you saw U+FEFF in real-world text (not
    in a test case) used with the ZWNBSP semantic?

    --
    Doug Ewell  |  Thornton, Colorado, USA  |  http://www.ewellic.org
    RFC 5645, 4645, UTN #14  |  ietf-languages @ http://is.gd/2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Fri Jan 01 2010 - 11:20:15 CST