Re: Support for non-BMP characters

From: Jukka K. Korpela <jkorpela_at_cs.tut.fi>
Date: Wed, 25 Apr 2012 14:50:52 +0300

2012-04-25 12:09, Marc Durdin wrote:

> Probably the most egregious example I know of is JavaScript.
> As far as I know, JavaScript still only groks UCS-2. I'd love to be
wrong.

The ECMAScript standard neither requires nor forbids support to non-BMP
characters: “A conforming implementation of this Standard shall
interpret characters in conformance with the Unicode Standard, Version
3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the
adopted encoding form, implementation level 3. If the adopted ISO/IEC
10646-1 subset is not otherwise specified, it is presumed to be the BMP
subset, collection 300. If the adopted encoding form is not otherwise
specified, it presumed to be the UTF-16 encoding form.”
   http://www.ecma-international.org/publications/standards/Ecma-262.htm

In practice, modern implementations support UTF-16 and the full Unicode
coding space. There are of course problems with fonts, and the native
“character” type is still a 16-bit code unit, and things are generally
clumsy, but still. You can even have non-BMP characters directly as data
in a UTF-8 encoded HTML document, and when you access such data in
client-side JavaScript, the browser will have internally converted the
data to UTF-8 format, so the JavaScript code sees a non-BMP character as
two code units, or “JavaScript characters.”

Demo:

<!doctype html>
<meta charset=utf-8>
<p id=p>&#x1D64F;</p>
<script>
var s = document.getElementById('p').innerHTML;
document.write(s.charCodeAt(0));
document.write(', ');
document.write(s.charCodeAt(1));
</script>

In modern browsers, this displays U+1D64F (mathematical sans-serif bold
italic capital t) and then two numbers that constitute the UTF-16
encoded representation.

You could even use the non-BMP character as such in a JavaScript literal
in a UTF-8 encoded document, like s = '𝙏'. Though support is not
required, modern browsers deal with this. Technically, a JavaScript
literal consists of code units, but non-BMP characters just generate two
code units each.

Yucca
Received on Wed Apr 25 2012 - 06:54:48 CDT

This archive was generated by hypermail 2.2.0 : Wed Apr 25 2012 - 06:54:50 CDT