RE: Support for non-BMP characters

From: Marc Durdin <marc.durdin_at_tavultesoft.com>
Date: Wed, 25 Apr 2012 09:41:40 +0000

Yes, but this means that regexes with SMP don’t work (e.g. [𝒜-𝒵]), character counts returns code units, etc. So you have to reimplement string.length, string.charCodeAt, etc, if you don’t want to deal with surrogate pairs (I reckon you’ve got better things to be spending your time on).

http://dheeb.files.wordpress.com/2011/07/gbu.pdf “Unicode Support Shootout - The Good, the Bad & the (mostly) Ugly” by Tom Christiansen has a great summary of some of the issues with relying on JavaScript’s internal string manipulation (unfortunately can’t find a better working link at present – the official training.perl.com site seems to be down). Actually, that presentation is a fantastic place to start for understanding many of the limitations of various programming languages’ support for Unicode – if you haven’t read it, I’d urge you to go read it now.

Marc

From: Szelp, A. Sz. [mailto:a.sz.szelp_at_gmail.com]
Sent: Wednesday, 25 April 2012 7:28 PM
To: Marc Durdin
Cc: David Starner; Unicode Mailing List
Subject: Re: Support for non-BMP characters

Shouldn't it be technically possible to store Supplementary Plane characters in UTF-16 / UCS-2 as well? Isn't that what Surrogate Pairs are for?

Sz
On Wed, Apr 25, 2012 at 11:09, Marc Durdin <marc.durdin_at_tavultesoft.com<mailto:marc.durdin_at_tavultesoft.com>> wrote:
Probably the most egregious example I know of is JavaScript. As far as I know, JavaScript still only groks UCS-2. I'd love to be wrong.

Marc

-----Original Message-----
From: unicode-bounce_at_unicode.org<mailto:unicode-bounce_at_unicode.org> [mailto:unicode-bounce_at_unicode.org<mailto:unicode-bounce_at_unicode.org>] On Behalf Of David Starner
Sent: Wednesday, 25 April 2012 6:32 PM
To: Unicode Mailing List
Subject: Support for non-BMP characters

It's been ten years since the first non-BMP characters were encoded.
How are they working in your neck of the woods? There's a lot of places where they're working just fine, but I was facing MySQL's support. It has had support for UCS-2 and UTF-8 limited to the BMP for a long time; now in MySQL 5.5 there's utf16, utf32 and utf8mb4. (MySQL
5.1 and 5.5 are the current stable releases.) But there's enough warnings about incompatibilities with utf8mb4 to make me pause before switching my private database to it, and I think the net will see MySQL databases with utf8 instead of utf8mb4 as long as MySQL exists, unless they decide to push people over to it.

(Ada's an issue too, though not one most people will have to deal with. While Ada 2005 added a UTF-32 string type, it left the UCS-2 string type as is. Again, I suspect a lot of nominally Unicode Ada programs are going to BMP-only. Of course, UTF-8 as an ASCII superset is used, stuffed into strings labeled Latin-1; it's technically not conformant with the Ada standard but it works so long as you don't need much string processing.)

In any case, is the use of non-BMP characters still problematic in your corner of the computing world or is everything looking fine from where you are?

--
Kie ekzistas vivo, ekzistas espero.




Received on Wed Apr 25 2012 - 04:44:04 CDT

This archive was generated by hypermail 2.2.0 : Wed Apr 25 2012 - 04:44:05 CDT