Re: Numerosity (was: Re: Planck's constant U+210E)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Apr 22 2006 - 03:24:45 CST

  • Next message: Philippe Verdy: "Re: Numerosity (was: Re: Planck's constant U+210E)"

    From: "Doug Ewell" <dewell@adelphia.net>
    > Actually, the "hyper surrogate" approach isn't even original:
    > http://hp.vector.co.jp/authors/VA002891/UTF16X.TXT

    I don't know when this was published but it seems too much verbose (6 code units). I spoke about it in this list nearly 2 years ago...

    What I suggested in the previous mail uses only 4 code units) to get all possible positive codepoints with signed 32-bit values that are positive. And it does not necessarily require keeping free space in the BMP (unless we really want to have more compact encoding sequences on 3 16-bit code-units).

    And I suggest several alternatives regarding which ranges are needed. I said that strict backward compatibility was possible, but upward compatibility would bepossible as well given that it will only use sequences of valid but unassigned codepoints: Upward compatibility is possible provided that legacy algorithms don't break those sequences (my opinion is that a Unicode compatible algorithm should not attempt to break a sequence of unassigned codepoints, even if they are valid, and should not insert any other codepoint in the middle, without having knowledge ofwhat thesecodepoints represent; as they are unassigned, in the special plane, applications should not infer anything about those sequences, and treat them blindly.

    However I am not sure that Unicode enforces this rule for unassigned codepoints (it just says: don't use them for encoding new texts until they have a true assignement and semantic, but does not say very precisely how the application or algorithms should behave regarding sequences of unknown but valid codepoints).

    For this reason, those extra surrogates should be encoded in a contiguous block, to let existing conforming applications use exactly the same character properties for all of them (except their code point value of course), in all algorithms (including UCA, case mappings, word breakers). This may cause a problem if those sequences are long because there will be no indication at all about where character breaks are possible (those applications that need arbitrary character or word breaks, due to length constraints, are generally renderers, so there's no encoded string output, and even if the application breaks the *rendered* string of unknown codepoints to show missing glyphs, this shouldnot cause any problem).

    There's no need to encode those code points now, but as soon as we know that it will be possible to do it later in a rather large contiguous block (for example in the special plane 14), nothing would prevent adding them later.



    This archive was generated by hypermail 2.1.5 : Sat Apr 22 2006 - 03:29:49 CST