Re: UTF-c

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Feb 26 2011 - 13:22:26 CST

  • Next message: Thomas Cropley: "UTF-c, UTF-i"

    2011/2/26 William_J_G Overington <wjgo_10009@btinternet.com>:
    > Philippe Verdy <verdy_p@wanadoo.fr> wrote:
    >
    >> Note that the scalar values range 0xD800..0xDFFF reserved for surrogates code points MUST be excluded to be a conforming UTF (these code points must not be representable, to allow full bidirectional compatibility with UTF-16 ; this is unlike all other codepoints assigned to non-characters which SHOULD still be representable).
    >
    > How do you arrive at the conclusion about the surrogates please?

    Surrogates have no scalar values. Only the Unicode code points that
    have a scalar value should be encodable in a conforming UTF (but all
    of them, including those assigned to non-characters).

    > Is it because there are some rules somewhere that require that a surrogate pair copied from a UTF16 sequence must first be combined to produce one codepoint and then that codepoint must be compressed, rather than that the two codepoints be each individually compressed?

    No. It's independant of that (see BOCU-1, which also does not allow
    encoding isolated or incorrectly paired surrogates, but will still
    encode all other code points in any plane as an unbreakable sequence).

    > If so, do those rules necessarily apply to utf-c2? If so, would they apply if the format were denoted by a name that does not include the sequence utf?
    > Would compressing the surrogate codes separately make the design of the format simpler?

    No. It would make it longer and still not simpler.

    > Could sequences starting 10.000000 and 10.000001 be used for switching codes?

    No. You've not correctly read. They encode non-final bytes of
    characters (or special codes) encoded as multi-byte sequences. The
    sequence must still be parsed as a whole in order to compute the
    offseted scalar value that it represents. From this scalar value
    (which may be altered by the BASE), you can deduce if it's the scalar
    value of a valid code point (in 0x0000..0xD7FF or 0xE000..0x10FFFF) or
    a special code (for any other value : this includes switch codes, and
    all other unassigned and reserved values).



    This archive was generated by hypermail 2.1.5 : Sat Feb 26 2011 - 13:25:20 CST