Re: javascript and unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 27 2003 - 17:49:14 EDT

  • Next message: Magda Danish \(Unicode\): "FW: Web Form: Problems - Hebrew in Java applet"

    From: "Markus Scherer" <markus.scherer@jtcsv.com>
    > Paul Hastings wrote:
    > > would it be correct to say that javascript "natively" supports unicode?
    >
    > ECMAScript, of which JavaScript and JScript are implementations, is defined on 16-bit Unicode
    > scripts and using 16-bit Unicode strings.
    >
    > In other words, the basic encoding support is there, but there are basically no Unicode-specific
    > APIs in the standard. No character properties, no collation that is guaranteed to do more than
    > strcmp, etc. Script writers have to rely on implementation-specific functions or supply their own.

    It would be more correct to say that ECMAScript handles text using the UTF-16 encoding form on most platforms, and so can handle any Unicode character. However, it's true that ECMAScript will allow you to create invalide Unicode strings, as it allows you to create strings where surrogate characters do not pair.

    This says nothing on the internal encoding of strings within ECMA engines: it could as well use CESU-8 internally, but this will internal encoding will be hidden.

    So the situation of ECMAScript isexactly similar to Java (in which the builtin type "char" is an unsigned 16 bit integer, and the String type is handled in terms of "char" code units with UTF-16). However the serialization of compiled Java classes internally encodes these strings with UTF-8, which is deserialized to UTF-16 when the class is loaded.

    You will have a similar situation on Windows with the Win32 API, and in its C/C++ binding using TCHAR (and the T() macro for string constants) with the _UNICODE compile-time define. Or on all systems where the ANSI C type wchar_t is defined as a 16 bit integer.

    Note that we are speaing here about code units, not codepoints. The code units is what programming languages use to handle strings, not codepoints. As code units are well defined in Unicode in relation with a encoding form, any language or system can be made compliant to fully support Unicode, if it also provides library functions for string handling that implement the Unicode-defined algorithms (described in terms of code points).

    It's up to the library (not the language) to make its implementation of Unicode with code units comply with the standard algorithms based on code points. Of course it is much easier to implement these algorithms with 16-bit code units than with 8-bit code units. But the language itself has no other special Unicode compliance characteristic.



    This archive was generated by hypermail 2.1.5 : Tue May 27 2003 - 18:29:03 EDT