Re: UTF-8 ill-formed question

From: Philippe Verdy <>
Date: Sun, 16 Dec 2012 21:15:44 +0100

But the old Marco design at that time (2002) was still ignoring the Unicode
UTF-8 conformance constraints, as demonstrated in its use of the obsolete
"U-00nnnnn" notation (mathcing the obsolete ISO/IETF definition). If the
puprpose of this pocket conversion card is to be used for tutorial purpose,
omitting the validity constraint is not very didactic and could continue to
cause compatibility troubles if theses rules are not exposed and learnt,
and consequently ignored in applications.

Note that in my previous post, I dropped the extra leading zeroes in
Marco's use of the obsolete "U-00nnnnn" notation of supplementary
codepoints, but I forgot to change the "U-" prefix into "U+" for these
supplementary code points. Sorry about that.

Of course there are better ways to present this card to something that will
be printed (then placed under a reusable plastic cover, like an identity
card or driver licence card, and the size of a credit card for your
jacket), using HTML or PDF instead of just this basic plain-text format.
The usage instructions on the back side would also be clearer, and there
would be additional visual hints to make it more obvious. And you would be
less restricted for drawing the diagram without using the ugly characters
of box framing symbols (only usable with monospaced fonts which are ugly
for presenring the instructions). The pocket card would also use background
colors to better exhibit an all white frame where you need to write
something (better than using a dot), and what is fixed in the layout.

There are also other possible presentations, if printing a similar tool on
a carton : just use rotating wheels (1 for VW, 1 for X, 1 for Y, you may
ignore the Z wheel which will display the same value in the input and in
the output window) and a front masking carton with windows showing the
input and the result of the conversion ! You don't need any pen, it's
reusable, simpler and faster to use.

2012/12/16 Doug Ewell <>

> I remember Marco's original post in 2002. His intent was to give people
> with an actual U+ code point that needed converting—like James Lin ten
> years later—a quick way to do so without getting immersed in all the
> bit-shifting math.
> If this were a routine being run by a computer, or a tutorial on UTF-8, I
> would agree that it should have taken loose surrogates into account. But
> it's not. It's just a quick manual reference guide, and loose surrogates
> are 0.0001% of the real-world problem for users like James.
> While I note that Philippe's amended version seems straightforward and in
> keeping with Marco's original intent (short and simple), I'd like to
> suggest that neither Marco for creating the original guide, nor anyone else
> for doing up UTF-16 and UTF-32 versions, nor Otto for reposting them on the
> list this week, need to be beaten up any further over this edge case.
> --
> Doug Ewell | Thornton, Colorado, USA
> | @DougEwell ­
Received on Sun Dec 16 2012 - 14:19:03 CST

This archive was generated by hypermail 2.2.0 : Sun Dec 16 2012 - 14:19:03 CST