Re: Wrong UTF-8 encoders still around?

From: Bjoern Hoehrmann <derhoermi_at_gmx.net>
Date: Fri, 21 Oct 2011 01:41:53 +0200

* Martin J. Dürst wrote:
>I'm hoping to get some advice from people with experience with various
>Unicode/transcoding libraries.
>
>RFC 3987 (the current IRI spec) has the following text:
>
> Note: Some older software transcoding to UTF-8 may produce illegal
> output for some input, in particular for characters outside the
> BMP (Basic Multilingual Plane). As an example, for the IRI with
> non-BMP characters (in XML Notation):
> "http://example.com/&#x10300;&#x10301;&#x10302";
> which contains the first three letters of the Old Italic alphabet,
> the correct conversion to a URI is
> "http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"

MySQL versions that do not support non-BMP characters with the native
UTF-8 type are probably still around as I recall the history of that,
and I've recently encountered plenty of encoders that do not support
non-BMP characters either (JavaScript SHA-1 implementations come to
mind), but a general purpose encoder that people might reasonably use
(perhaps because there is no simple alternative) with IRI software
seems a bit of a stretch. I do think it would be useful to have such
an example in the specification nevertheless as people tend to test
their code using examples in the specification if there is no other
test suite immediately available, but it should be in some "examples"
or "test cases" section, akin to section 5.4.1. in RFC 3986, without
the commentary in the note you've quoted.

-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
Received on Thu Oct 20 2011 - 18:44:58 CDT

This archive was generated by hypermail 2.2.0 : Thu Oct 20 2011 - 18:44:59 CDT