Re: HTML5 encodings

From: Doug Ewell (doug@ewellic.org)
Date: Fri Jan 01 2010 - 11:17:15 CST

Next message: Ed Trager: "Quick Question About Korean Input Methods"

Previous message: verdy_p: "Re: HTML5 encodings"
In reply to: verdy_p: "Re: HTML5 encodings"
Next in thread: Richard Wordingham: "Re: HTML5 encodings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Happy New Year to all.

"verdy_p" <verdy underscore p at wanadoo dot fr> wrote:

>> Unicode, and even ASCII, contains plenty of seldom-used control
>> characters, with defined semantics if that is desirable, which an
>> internal process can safely insert, use, and remove for purposes like
>> this.
>
> No, you're wrong, there's no such character. If it existed, then this
> character would also have a use within normal strings that would be
> part of a primary key, and that would break the logic. If it is
> "seldom used", it does not qualify as it will conflict with this
> seldom use, so it will unavoidably be UNUSABLE to insert/use/remove
> for such purpose.

If you are concerned that every possible control character, like U+009C
STRING TERMINATOR or U+0081 <I don't have a name because nobody uses
me>, might appear in the real text, then yes, this is a problem.

> The BOCU-1 RESET code is NOT a character, and what I wrote was exactly
> the kind of use where it can be beneficial, because BOCU-1 was
> designed with the express purpose of being a binary-ordered encoding
> suitable for collation according to code point's scalar values.

Right, I'm aware the reset byte is not a character.

> I DID NOT say that a RESET code neded to be inserted in the
> plain-text, but its insertion with a collation key as a key separator
> DOES NOT violate the rule, as we can completely warranty that it will:
> - never present in encoded plan-texts
> - will always sort AFTER any valid Unicode character
> - will not be ignored.

If you want to use a mechanism that is internal to BOCU-1 to serve a
metadata purpose, be my guest. You will not be able to convert your
data to any other encoding and still retain this metadata. If that is
not a problem for you, great.

Hopefully you read what I wrote about UTF-8 and tag characters, or
remembered when it happened. It is a valuable lesson.

> An I still maintain that the special RESET code in BOCU-1 should NEVER
> be present in any encoded plain-text (as effectively it has the
> potential of creating multiple distinct encodings for equivalent
> texts).

This is not an absolute rule of BOCU-1, and the authors indicate how it
could be useful for concatenating strings, which seems to me a more
common scenario than sorting multi-column text in BOCU-1 using only the
untailored UCA.

> So it does not absolutely need a leading BOM

With its lack of transparency with ASCII or any other encoding, I can
hardly think of an encoding that is more in need of a BOM than BOCU-1.

> (My opinion is that, for interchange purpose, BOMS should be allowed
> in ALL encodings if they can represent the U+FEFF codepoint,

No argument there.

> and that this codepoint should also exclusively represent a BOM and no
> ZWNSP semantic

Too late; U+FEFF nominally still has both semantics, but see below.

> if needed one could replace all ZWNBSP by ZWJ, making sure that all
> final renderers will either be able to render it).

This is a hack. Developers of renderers should make ZWNBSP display
correctly. It's not that hard. Creators of documents shouldn't have to
modify their text to appease the renderer. And remember, it's
default-ignorable.

> All the legacy problems about the BOM would have been much simpler if
> it had been mapped to a non-character (exactly like also U+FFFE)
> instead of a legacy control format (like U+FEFF), but now it is too
> late to change it or recommand some other codepoint.

U+2060 WORD JOINER is recommended for the ZWNBSP semantic. And
honestly, when was the last time you saw U+FEFF in real-world text (not
in a test case) used with the ZWNBSP semantic?

--
Doug Ewell  |  Thornton, Colorado, USA  |  http://www.ewellic.org
RFC 5645, 4645, UTN #14  |  ietf-languages @ http://is.gd/2kf0s

Next message: Ed Trager: "Quick Question About Korean Input Methods"
Previous message: verdy_p: "Re: HTML5 encodings"
In reply to: verdy_p: "Re: HTML5 encodings"
Next in thread: Richard Wordingham: "Re: HTML5 encodings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 01 2010 - 11:20:15 CST