Re: UTF-c

From: William_J_G Overington (
Date: Sat Feb 26 2011 - 04:59:17 CST

  • Next message: Philippe Verdy: "Re: UTF-c"

    Philippe Verdy <> wrote:
    > |  6 bits : 11.yyxxxx
    > |               
    > Encodes U+00C0..U+00FF (by default) :
    > |               
    > yyxxxxx = Unicode scalar value - BASE
    > |               
    > BASE should necessarily be a multiple of 16 (policy of ISO/IEC 10646-1 for block allocations).
    > |               
    > BASE must then be able to store up to 15 bits if arbitrary positions in the UCS are possible
    > |               
    > BASE is then constrained to 0x80 .. 0x10FFF0 (by step of
    > 16).
    > |               
    > Same as ISO-8859-1 only if BASE=0xC0
    > |               
    > (BASE may be different from 0xC0 if a switch code has been explicitly used in the stream)
    When a byte starting 11 is used in isolation, why is it represented as 11.yyxxxx please?
    Is it because there are four possible values of BASE, namely BASE[0], BASE[1], BASE[2] and BASE[3]?
    If BASE has a non-negative value less than 0x80, could that value of BASE be used to signal accessing a decoding tree so that the most common codepoints in the text from beyond the range U+0000 to U+007F could be represented using a single byte starting with 11? The contents of the decoding tree could be dynamically altered using switching codes.
    If the idea of four values for BASE, in BASE[0], BASE[1], BASE[2] and BASE[3] is used, then access to a decoding tree would be possible simultanwously with one-byte access to a contiguous block of other Unicode characters if so desired, though if BASE[0], BASE[1], BASE[2] and BASE[3] are used the range of possible values of BASE would need to be 17 bits.
    For example, at some particular time in some particular application of the format, BASE[0] might have a value of 0x00 and BASE[1] might have a value of 0x100.
    William Overington
    26 February 2011

    This archive was generated by hypermail 2.1.5 : Sat Feb 26 2011 - 05:02:58 CST