Re: javascript and unicode

From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue May 27 2003 - 18:18:24 EDT

  • Next message: Kenneth Whistler: "Re: Not snazzy (was: New Unicode Savvy Logo)"

    One minor correction:

    > However, it's true that ECMAScript will allow you to create invalide
    Unicode strings.

    More precisely, ECMAScript (and other systems) will allow you to
    create 16-bit Unicode strings that are not UTF-16.

    See Section 2.7 in http://www.unicode.org/book/preview/ch02.pdf.

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    To: <unicode@unicode.org>
    Sent: Tuesday, May 27, 2003 14:49
    Subject: Re: javascript and unicode

    > From: "Markus Scherer" <markus.scherer@jtcsv.com>
    > > Paul Hastings wrote:
    > > > would it be correct to say that javascript "natively" supports
    unicode?
    > >
    > > ECMAScript, of which JavaScript and JScript are implementations,
    is defined on 16-bit Unicode
    > > scripts and using 16-bit Unicode strings.
    > >
    > > In other words, the basic encoding support is there, but there are
    basically no Unicode-specific
    > > APIs in the standard. No character properties, no collation that
    is guaranteed to do more than
    > > strcmp, etc. Script writers have to rely on
    implementation-specific functions or supply their own.
    >
    > It would be more correct to say that ECMAScript handles text using
    the UTF-16 encoding form on most platforms, and so can handle any
    Unicode character. However, it's true that ECMAScript will allow you
    to create invalide Unicode strings, as it allows you to create strings
    where surrogate characters do not pair.
    >
    > This says nothing on the internal encoding of strings within ECMA
    engines: it could as well use CESU-8 internally, but this will
    internal encoding will be hidden.
    >
    > So the situation of ECMAScript isexactly similar to Java (in which
    the builtin type "char" is an unsigned 16 bit integer, and the String
    type is handled in terms of "char" code units with UTF-16). However
    the serialization of compiled Java classes internally encodes these
    strings with UTF-8, which is deserialized to UTF-16 when the class is
    loaded.
    >
    > You will have a similar situation on Windows with the Win32 API, and
    in its C/C++ binding using TCHAR (and the T() macro for string
    constants) with the _UNICODE compile-time define. Or on all systems
    where the ANSI C type wchar_t is defined as a 16 bit integer.
    >
    > Note that we are speaing here about code units, not codepoints. The
    code units is what programming languages use to handle strings, not
    codepoints. As code units are well defined in Unicode in relation with
    a encoding form, any language or system can be made compliant to fully
    support Unicode, if it also provides library functions for string
    handling that implement the Unicode-defined algorithms (described in
    terms of code points).
    >
    > It's up to the library (not the language) to make its implementation
    of Unicode with code units comply with the standard algorithms based
    on code points. Of course it is much easier to implement these
    algorithms with 16-bit code units than with 8-bit code units. But the
    language itself has no other special Unicode compliance
    characteristic.
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue May 27 2003 - 19:11:17 EDT