Re: UTF-c

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Feb 25 2011 - 22:41:41 CST

  • Next message: William_J_G Overington: "Re: UTF-c"

    An alternative:

    | UTF-c2: ASCII and 2bit prefixes
    |
    |  0....... isolation prefix,
    |  10...... initial or medial prefix,
    |  11...... final or isolation prefix.
    |
    | 7 bits : 0.xxxxxxx
    | Encodes U+0000..U+007F
    | xxxxxxx = Unicode scalar value
    | Same as ASCII, ISO-8859-* and UTF-8
    | (independant of BASE)
    | 6 bits : 11.yyxxxx
    | Encodes U+00C0..U+00FF (by default) :
    | yyxxxxx = Unicode scalar value - BASE
    | BASE should necessarily be a multiple of 16 (policy
    of ISO/IEC 10646-1 for block allocations).
    | BASE must then be able to store up to 15 bits if
    arbitrary positions in the UCS are possible
    | BASE is then constrained to 0x80 .. 0x10FFF0 (by step of 16).
    | Same as ISO-8859-1 only if BASE=0xC0
    | (BASE may be different from 0xC0 if a switch code has
    been explicitly used in the stream)
    | 12 bits : 10.yyyyyx, 11.xxxxxx
    | Encodes U+0080..U+407F (minus the 64-character block
    starting at BASE) :
    | yyyyxxxxxxxx = Unicode scalar value - 0x80
    | (independant of BASE)
    | 18 bits : 10.yyyyyy, 10.yyyyyx, 11.xxxxxx
    | Encodes U+4080..U+4407F (minus the 64-character block
    starting at BASE) :
    | yyyyyyyyyyyxxxxxxx = Unicode scalar value - 0x4080
    | * Restriction: scalar values in range 0xD800..0xDFFF
    (reserved for surrogates) are invalid.
    | (independant of BASE)
    | 21 bits : 10.000yyy, 10.yyyyyy, 10.yyyyyx, 11.xxxxxx
    | (U+44080..U+10FFFF (minus the 64-character block
    starting at BASE) :
    | yyyyyyyyyyyyyyxxxxxxx = Unicode scalar value - 0x44080
    | * Restriction: the restricted maximum scalar value is
    0x10FFFF, higher values are invalid.
    | (independant of BASE)

    With this scheme, you ensure that at least all final (or isolated)
    positions will be detected (and then the next position is initial or
    isolated). This offers better resynchronization if there's some loss
    (that will affect only one scalar value).

    This is not warrantied in your original scheme within a bounded number
    of bytes when you use the same prefix for initial and final bytes,
    with characters encoded as 2-byte sequences.

    Note that the scalar values range 0xD800..0xDFFF reserved for
    surrogates code points MUST be excluded to be a conforming UTF (these
    code points must not be representable, to allow full bidirectional
    compatibility with UTF-16 ; this is unlike all other codepoints
    assigned to non-characters which SHOULD still be representable). Under
    this scheme this means that the following values for the 3-byte
    encoding are unused:
       yyyyyyyyyyyxxxxxxx = 0x9780..0x9F7F
    These would have been encoded as these 3-byte sequences:
      10.000101, 10.011100, 11.000000 (binary)
      ..
      10.001001, 10.111101, 11.111111 (binary)
    You could still use them to represent the switch codes allowing to
    change the value of BASE anywhere in U+008° .. U+10FFF°, or at least
    (if we exclude the private supplementary planes) in 0x80..0xEFF0 (i.e.
    one of the 0xE80 possible rows).

    However, given that there are only 0x800 surrogate values, we have to
    restrict to only the first 0x800 rows starring at U+008°, but it can
    still express any arbitrary BASE position in U+008°..U+808° (only the
    first half of the BMP). We could be smarter and exclude the
    64-character rows in the CJK and Hangul blocks and other large
    syllabaries (as positionning the BASE there would not be useful, as
    well as, of course the blocks assigned to surrogates, so we don't
    really need to position BASE in U+340°..U+4DB°, or U+4E0°..U+A48°, or
    U+A50°..U+A63°, or U+AC0°..U+D7A°, U+D8F°..U+FAF°)

    For encoding other BASE positions (in higher planes for example, we
    can also use the forbidden space of scalar values starting at 0x110000
    and encodabl in an extended set of 4-byte sequences starting at:
      yyyyyyyyyyyyyyxxxxxxx = 0x110000-0x44080 = 0xCBF80
    i.e. in binary:
      10.000011, 10.001011, 10.111110, 10.000000
       ..
      10.111111, 10.111111, 10.111111, 10.111111

    This last extended set of sequences contains exactly 0xF34080 distinct
    sequences : this is MUCH enough to encode all possible values for
    BASE, as well as other compression schemes, such as a compact 2-byte
    encoding for large Unicode blocks of up to 12-bits in size (4096
    characters), using a new BASE2 value to replace the default range
    U+0080..U+407F (which would have then to be relocated themselves using
    a 4-bytes encoding in the same extended set of sequences, unless we
    decide that 3-bytes encoded characters are also starting at U+0080
    instead of 0x4080, and 4-bytes encoded characters start at U+40080
    instead of U+44080, and we explcitly say that these 3-byte and 4-byte
    encoded characters exclude those that are selected by BASE2).

    Philippe.

    2011/2/20 Christoph Päper <christoph.paeper@crissov.de>:
    > Thomas Cropley:
    >
    >> <UTF-c.htm>
    >
    > It’s a fair idea to be backwards compatible with (most of) ISO 8859-1 by encoding U+00C0–00FF as C0h (11000000b) through FFh (11111111b) – I will not consider codepage switching with quasi-BOMs at all, because it seems like a bad idea, U+00A0–00BF are missing anyhow – and reusing the bytes 80h (10000000b) through BFh (10111111), not 9Fh , for encoding higher codepoints. I don’t think it’s a good idea to also use 11......b in multibyte code sequences, though.
    >
    > UTF-8: ASCII and 3–5bit/2bit prefixes
    >
    >  0....... isolation prefix,
    >  110..... initial prefix,
    >  1110.... initial prefix,
    >  11110... initial prefix,
    >  11111... illegal prefix;
    >  10...... medial and final prefix.
    >
    >  7  0xxxxxxx
    >  11  110yyyxx 10xxxxxx
    >  16  1110yyyy 10yyyyxx 10xxxxxx
    >  21  11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
    >
    > UTF-c: ASCII and 2bit prefixes
    >
    >  0....... isolation prefix,
    >  10...... initial and final prefix,
    >  11...... medial and isolation prefix.
    >
    >  7  0xxxxxxx
    >  6  11xxxxxx
    >  12  10yyyyxx 10xxxxxx
    >  18  10zzyyyy 11yyyyxx 10xxxxxx
    >  21  10°°°zzz 11zzyyyy 11yyyyxx 10xxxxxx



    This archive was generated by hypermail 2.1.5 : Fri Feb 25 2011 - 22:48:17 CST