Re: UTF-c

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Feb 25 2011 - 22:41:41 CST

Next message: William_J_G Overington: "Re: UTF-c"

Previous message: Mark Rosa: "Re: Kaida font (work in progress)"
In reply to: Christoph Päper: "Re: UTF-c"
Next in thread: Doug Ewell: "RE: UTF-c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

An alternative:

| UTF-c2: ASCII and 2bit prefixes
|
| 0....... isolation prefix,
| 10...... initial or medial prefix,
| 11...... final or isolation prefix.
|
| 7 bits : 0.xxxxxxx
| Encodes U+0000..U+007F
| xxxxxxx = Unicode scalar value
| Same as ASCII, ISO-8859-* and UTF-8
| (independant of BASE)
| 6 bits : 11.yyxxxx
| Encodes U+00C0..U+00FF (by default) :
| yyxxxxx = Unicode scalar value - BASE
| BASE should necessarily be a multiple of 16 (policy
of ISO/IEC 10646-1 for block allocations).
| BASE must then be able to store up to 15 bits if
arbitrary positions in the UCS are possible
| BASE is then constrained to 0x80 .. 0x10FFF0 (by step of 16).
| Same as ISO-8859-1 only if BASE=0xC0
| (BASE may be different from 0xC0 if a switch code has
been explicitly used in the stream)
| 12 bits : 10.yyyyyx, 11.xxxxxx
| Encodes U+0080..U+407F (minus the 64-character block
starting at BASE) :
| yyyyxxxxxxxx = Unicode scalar value - 0x80
| (independant of BASE)
| 18 bits : 10.yyyyyy, 10.yyyyyx, 11.xxxxxx
| Encodes U+4080..U+4407F (minus the 64-character block
starting at BASE) :
| yyyyyyyyyyyxxxxxxx = Unicode scalar value - 0x4080
| * Restriction: scalar values in range 0xD800..0xDFFF
(reserved for surrogates) are invalid.
| (independant of BASE)
| 21 bits : 10.000yyy, 10.yyyyyy, 10.yyyyyx, 11.xxxxxx
| (U+44080..U+10FFFF (minus the 64-character block
starting at BASE) :
| yyyyyyyyyyyyyyxxxxxxx = Unicode scalar value - 0x44080
| * Restriction: the restricted maximum scalar value is
0x10FFFF, higher values are invalid.
| (independant of BASE)

With this scheme, you ensure that at least all final (or isolated)
positions will be detected (and then the next position is initial or
isolated). This offers better resynchronization if there's some loss
(that will affect only one scalar value).

This is not warrantied in your original scheme within a bounded number
of bytes when you use the same prefix for initial and final bytes,
with characters encoded as 2-byte sequences.

Note that the scalar values range 0xD800..0xDFFF reserved for
surrogates code points MUST be excluded to be a conforming UTF (these
code points must not be representable, to allow full bidirectional
compatibility with UTF-16 ; this is unlike all other codepoints
assigned to non-characters which SHOULD still be representable). Under
this scheme this means that the following values for the 3-byte
encoding are unused:
   yyyyyyyyyyyxxxxxxx = 0x9780..0x9F7F
These would have been encoded as these 3-byte sequences:
  10.000101, 10.011100, 11.000000 (binary)
  ..
  10.001001, 10.111101, 11.111111 (binary)
You could still use them to represent the switch codes allowing to
change the value of BASE anywhere in U+008° .. U+10FFF°, or at least
(if we exclude the private supplementary planes) in 0x80..0xEFF0 (i.e.
one of the 0xE80 possible rows).

However, given that there are only 0x800 surrogate values, we have to
restrict to only the first 0x800 rows starring at U+008°, but it can
still express any arbitrary BASE position in U+008°..U+808° (only the
first half of the BMP). We could be smarter and exclude the
64-character rows in the CJK and Hangul blocks and other large
syllabaries (as positionning the BASE there would not be useful, as
well as, of course the blocks assigned to surrogates, so we don't
really need to position BASE in U+340°..U+4DB°, or U+4E0°..U+A48°, or
U+A50°..U+A63°, or U+AC0°..U+D7A°, U+D8F°..U+FAF°)

For encoding other BASE positions (in higher planes for example, we
can also use the forbidden space of scalar values starting at 0x110000
and encodabl in an extended set of 4-byte sequences starting at:
  yyyyyyyyyyyyyyxxxxxxx = 0x110000-0x44080 = 0xCBF80
i.e. in binary:
  10.000011, 10.001011, 10.111110, 10.000000
   ..
  10.111111, 10.111111, 10.111111, 10.111111

This last extended set of sequences contains exactly 0xF34080 distinct
sequences : this is MUCH enough to encode all possible values for
BASE, as well as other compression schemes, such as a compact 2-byte
encoding for large Unicode blocks of up to 12-bits in size (4096
characters), using a new BASE2 value to replace the default range
U+0080..U+407F (which would have then to be relocated themselves using
a 4-bytes encoding in the same extended set of sequences, unless we
decide that 3-bytes encoded characters are also starting at U+0080
instead of 0x4080, and 4-bytes encoded characters start at U+40080
instead of U+44080, and we explcitly say that these 3-byte and 4-byte
encoded characters exclude those that are selected by BASE2).

Philippe.

2011/2/20 Christoph Päper <christoph.paeper@crissov.de>:
> Thomas Cropley:
>
>> <UTF-c.htm>
>
> It’s a fair idea to be backwards compatible with (most of) ISO 8859-1 by encoding U+00C0–00FF as C0h (11000000b) through FFh (11111111b) – I will not consider codepage switching with quasi-BOMs at all, because it seems like a bad idea, U+00A0–00BF are missing anyhow – and reusing the bytes 80h (10000000b) through BFh (10111111), not 9Fh , for encoding higher codepoints. I don’t think it’s a good idea to also use 11......b in multibyte code sequences, though.
>
> UTF-8: ASCII and 3–5bit/2bit prefixes
>
> 0....... isolation prefix,
> 110..... initial prefix,
> 1110.... initial prefix,
> 11110... initial prefix,
> 11111... illegal prefix;
> 10...... medial and final prefix.
>
> 7 0xxxxxxx
> 11 110yyyxx 10xxxxxx
> 16 1110yyyy 10yyyyxx 10xxxxxx
> 21 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
>
> UTF-c: ASCII and 2bit prefixes
>
> 0....... isolation prefix,
> 10...... initial and final prefix,
> 11...... medial and isolation prefix.
>
> 7 0xxxxxxx
> 6 11xxxxxx
> 12 10yyyyxx 10xxxxxx
> 18 10zzyyyy 11yyyyxx 10xxxxxx
> 21 10°°°zzz 11zzyyyy 11yyyyxx 10xxxxxx

Next message: William_J_G Overington: "Re: UTF-c"
Previous message: Mark Rosa: "Re: Kaida font (work in progress)"
In reply to: Christoph Päper: "Re: UTF-c"
Next in thread: Doug Ewell: "RE: UTF-c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Feb 25 2011 - 22:48:17 CST