Re: UTF-8 ill-formed question from Philippe Verdy on 2012-12-16 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 16 Dec 2012 17:48:15 +0100

OK then here is the minor change for UTF-8's MPE including the extra row
for strict conformance. This includes the stripping of non-standard leading
zeroes in U+nnnnn notations for code points.

(Yes, this is a derived work, I still credit him (but don't want to assume
any additional copyright), and it indicates this is a modified version, and
not the original, and I assume that he published it in a compatible licence
that allowed you to republish its 1.1 version on this list). The extra row
is very similar in its conversion mechansim, except that it explicitly
states the valid codepoints that can be safely converted. And some
abbreviations in the description or in one column header are now fully
expanded for clarity, but beside this, the text is identical:

Side 1 (print and cut out):

  ╔════════════╦═══════╤══════════════════════════════╗
  ║ U+0000 ║ yy zz │ Cima's ║
  ║ U+007F ║ ▼ ▼ │ UTF-8 Magic Pocket Encoder ║
  ║ YZ ║ . . │ Vers. 1.1.1, 16 Dec. 2012 ║
  ╠────────────╫───────┼───────┐ ╔══════╣
  ║ U+0080 ║ 3x xy │ 2y zz │ Derived from ║ Hex= ║
  ║ U+07FF ║ 3. .. │ 2. ▼ │ Vers. 1.1 ║ Base ║
  ║ XYZ ║ . . │ . . │ 30 June 2004 ║ -4 ║
  ╠────────────╫───────┼───────┼───────┐ ║ 0=00 ║
  ║ U+0800 ║ 32 ww │ 2x xy │ 2y zz │ M.C. ║ 1=01 ║
  ║ U+D7FF ║ ▼ ▼ │ 2. .. │ 2. ▼ │ ║ 2=02 ║
  ║ WXYZ ║ E . │ . . │ . . │ ║ 3=03 ║
  ╠────────────╫───────┼───────┼───────┤ ║ 4=10 ║
  ║ U+E000 ║ 32 3w │ 2x xy │ 2y zz │ ║ 5=11 ║
  ║ U+FFFF ║ ▼ ▼ │ 2. .. │ 2. ▼ │ ║ 6=12 ║
  ║ WXYZ ║ E . │ . . │ . . │ ║ 7=13 ║
  ╠────────────╫───────┼───────┼───────┼───────╢ 8=20 ║
  ║ U-10000 ║ 33 0v │ 2v ww │ 2x xy │ 2y zz ║ 9=21 ║
  ║ U-FFFFF ║ ▼ 0. │ 2. ▼ │ 2. .. │ 2. ▼ ║ A=22 ║
  ║ VWXYZ ║ F . │ . . │ . . │ . . ║ B=23 ║
  ╠────────────╫───────┼───────┼───────┼───────╢ C=30 ║
  ║ U-100000 ║ 33 10 │ 20 ww │ 2x xy │ 2y zz ║ D=31 ║
  ║ U-10FFFF ║ ▼ 1. │ 2. ▼ │ 2. .. │ 2. ▼ ║ E=32 ║
  ║ WXYZ ║ F 4 │ 8 . │ . . │ . . ║ F=33 ║
  ╚════════════╩═══════╧═══════╧═══════╧═══════╩══════╝

Side 2 (print, cut out, and glue on back of side 1):

  ╔═══════════════════════════════════════════════════╗
  ║ Cima's UTF-8 Magic Pocket Encoder - User's Manual ║
  ║ (version 1.1.1, 16 Dec. 2012 - modified from the ║
  ║ original version 1.1, 2004, by Marco Cimarosti) ║
  ║ ║
  ║ - Left column: min and max Unicode scalar values: ║
  ║ pick the row that applies to the code point you ║
  ║ want to convert to UTF-8. Letters V..Z mark the ║
  ║ hexadecimal digits that have to be processed. ║
  ║ - Right column: hexadecimal to base-4 table. ║
  ║ - Central columns: work area to compute each of ║
  ║ the 1 to 4 octets that constitute valid UTF-8 ║
  ║ octet sequences. ║
  ║ ║
  ║ Convert each digit marked by V..Z from hexadecimal║
  ║ to base-4. Write base-4 digits on the dots placed ║
  ║ under letters v..z (two base-4 digits per hex. ║
  ║ digit). Convert 2-digit base-4 number to hex. ║
  ║ digits and write them on the dots on the line. ║
  ║ That is your UTF-8 sequence in hexadecimal. ║
  ║ ▼ Triangular arrow heads show passages that may ║
  ║ be skipped, either because the digit is ║
  ║ hard-coded, or because it may be copied directly ║
  ║ from the scalar value. ║
  ╚═══════════════════════════════════════════════════╝

2012/12/16 Otto Stolz <Otto.Stolz_at_uni-konstanz.de>

> Hello,
>
>
> 2012/12/16 Otto Stolz <Otto.Stolz_at_uni-konstanz.de>
>
>> The reason I excluded the surrogates from my UTF-8 MPE
>> was really that I needed additional space for the user’s
>> guide on the reverse side.
>>
>
> Sorry, typo; I meant: “my UTF-16 MPE”. I added that
> extra row (with the branch excluding the surrogates)
> to gain extra space on the reverse sode.
>
> Am 2012-12-16 schrieb Philippe Verdy:
>
> Add this missing row, Everything in the reverse side can remain the same
>> (or can be using a less "cryptic" compact description of how it works).
>>
>
> I will certainly not change Marco Cimarosti’s original design
> of his UTF-8 MPE.
>
> Best wishes,
> Otto Stolz
>
>
>
>
Received on Sun Dec 16 2012 - 10:51:35 CST

This archive was generated by hypermail 2.2.0 : Sun Dec 16 2012 - 10:51:36 CST