RE: Wrong UTF-8 encoders still around? from Shawn Steele on 2011-10-20 (Unicode Mail List Archive)

From: Shawn Steele <Shawn.Steele_at_microsoft.com>
Date: Thu, 20 Oct 2011 23:28:41 +0000

Define "still around" :) Old software never dies... it just hangs around to make compatibility problems for a new generation.

-----Original Message-----
From: unicode-bounce_at_unicode.org [mailto:unicode-bounce_at_unicode.org] On Behalf Of "Martin J. Dürst"
Sent: Thursday, October 20, 2011 4:00 PM
To: Unicode Mailing List
Cc: Larry Masinter
Subject: Wrong UTF-8 encoders still around?

I'm hoping to get some advice from people with experience with various Unicode/transcoding libraries.

RFC 3987 (the current IRI spec) has the following text:

    Note: Some older software transcoding to UTF-8 may produce illegal
       output for some input, in particular for characters outside the
       BMP (Basic Multilingual Plane). As an example, for the IRI with
       non-BMP characters (in XML Notation):
       "http://example.com/𐌀𐌁&#x10302";
       which contains the first three letters of the Old Italic alphabet,
       the correct conversion to a URI is
       "http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"

We are thinking about removing this because we hope that software has improved in the meantime, but we would like to be sure about this.

If anybody knows about software out there that still presents this problems, please tell us.

Thanks, Martin.
Received on Thu Oct 20 2011 - 18:32:03 CDT

This archive was generated by hypermail 2.2.0 : Thu Oct 20 2011 - 18:32:04 CDT