Re: UTF-c

From: Christoph Päper (christoph.paeper@crissov.de)
Date: Sun Feb 20 2011 - 13:48:56 CST

  • Next message: Philippe Verdy: "Re: [unicode] UTF-c"

    Thomas Cropley:

    > <UTF-c.htm>

    It’s a fair idea to be backwards compatible with (most of) ISO 8859-1 by encoding U+00C0–00FF as C0h (11000000b) through FFh (11111111b) – I will not consider codepage switching with quasi-BOMs at all, because it seems like a bad idea, U+00A0–00BF are missing anyhow – and reusing the bytes 80h (10000000b) through BFh (10111111), not 9Fh , for encoding higher codepoints. I don’t think it’s a good idea to also use 11......b in multibyte code sequences, though.

    UTF-8: ASCII and 3–5bit/2bit prefixes

     0....... isolation prefix,
     110..... initial prefix,
     1110.... initial prefix,
     11110... initial prefix,
     11111... illegal prefix;
     10...... medial and final prefix.

      7 0xxxxxxx
     11 110yyyxx 10xxxxxx
     16 1110yyyy 10yyyyxx 10xxxxxx
     21 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx

    UTF-c: ASCII and 2bit prefixes

     0....... isolation prefix,
     10...... initial and final prefix,
     11...... medial and isolation prefix.

      7 0xxxxxxx
      6 11xxxxxx
     12 10yyyyxx 10xxxxxx
     18 10zzyyyy 11yyyyxx 10xxxxxx
     21 10°°°zzz 11zzyyyy 11yyyyxx 10xxxxxx

    Type 1: ASCII and 4bit prefix

     0....... isolation prefix,
     11...... isolation prefix,
     10##.... initial prefixes with following-bytes count,
     1000.... medial and final prefix.

      7 0xxxxxxx
      6 11xxxxxx
      8 1001xxxx 1000xxxx
     12 1010yyyy 1000xxxx 1000xxxx
     16 1011yyyy 1000yyyy 1000xxxx 1000xxxx
     
    => incomplete coverage.

    Type 2: ASCII and 5bit/3bit prefix

     0....... isolation prefix,
     11...... isolation prefix,
     101##... initial prefixes with following-bytes count (+1),
     100..... medial and final prefix.

      7 0xxxxxxx
      6 11xxxxxx
      8 10100xxx 100xxxxx
     13 10101yyy 100yyxxx 100xxxxx
     18 10110zzy 100yyyyy 100yyxxx 100xxxxx
     21 10111°°° 100°zzzy 100yyyyy 100yyxxx 100xxxxx

    Type 3.1: ASCII and 3bit prefix

     0....... isolation prefix,
     11...... isolation prefix,
     101..... initial and medial byte prefix,
     100..... final byte prefix.

      7 0xxxxxxx
      6 11xxxxxx
     10 101yyxxx 100xxxxx
     15 101yyyyy 101yyxxx 100xxxxx
     20 101zzzzy 101yyyyy 101yyxxx 100xxxxx
     21 101°°°°z 101zzzzy 101yyyyy 101yyxxx 100xxxxx

    Type 3.2: ASCII and 3bit prefix

     0....... isolation prefix,
     11...... isolation prefix,
     101..... initial and final prefix,
     100..... medial prefix.

      7 0xxxxxxx
      6 11xxxxxx
     10 101yyxxx 101xxxxx
     15 101yyyyy 100yyxxx 101xxxxx
     20 101zzzzy 100yyyyy 100yyxxx 101xxxxx
     21 101°°°°z 100zzzzy 100yyyyy 100yyxxx 101xxxxx

    Type 3.3: ASCII and 3bit prefix

     0....... isolation prefix,
     11...... isolation prefix,
     101..... initial prefix,
     100..... medial and final prefix.

      7 0xxxxxxx
      6 11xxxxxx
     10 101yyxxx 100xxxxx
     15 101yyyyy 100yyxxx 100xxxxx
     20 101zzzzy 100yyyyy 100yyxxx 100xxxxx
     21 101°°°°z 100zzzzy 100yyyyy 100yyxxx 100xxxxx

    Type 4: Latin1 and 4bit prefix

     0....... isolation prefix,
     101..... isolation prefix,
     11...... isolation prefix,
     1001.... initial prefix,
     1000.... medial and final prefix.

      7 0xxxxxxx
      6 11xxxxxx
      5 101xxxxx
      8 1001xxxx 1000xxxx
     12 1001yyyy 1000xxxx 1000xxxx
     16 1001yyyy 1000yyyy 1000xxxx 1000xxxx
     20 1001zzzz 1000yyyy 1000yyyy 1000xxxx 1000xxxx
     21 1001°°°z 1000zzzz 1000yyyy 1000yyyy 1000xxxx 1000xxxx

    Type 5: Latin1 and 6bit/4bit prefix

     0....... isolation prefix,
     101..... isolation prefix,
     11...... isolation prefix,
     1001##.. initial prefix with following-bytes count (+1),
     1000.... medial and final prefix.

      7 0xxxxxxx
      6 11xxxxxx
      5 101xxxxx
      6 100100xx 1000xxxx
     10 100101yy 1000xxxx 1000xxxx
     14 100110yy 1000yyyy 1000xxxx 1000xxxx
     18 100111zz 1000yyyy 1000yyyy 1000xxxx 1000xxxx

    => incomplete coverage.



    This archive was generated by hypermail 2.1.5 : Sun Feb 20 2011 - 13:51:25 CST