Re: Unicode native scripting language?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jan 18 2006 - 09:50:56 CST

  • Next message: Bob_Hallissy@sil.org: "RE: Unicode native scripting language?"

    C, C++, C# or Java effectively can handle sources encoded natively in Unicode and have the support in their library and/or datatypes system to handle Unicode text.
    However they don't qualify as "scripting languages" due to the required compilation step to produce the binary code that will be run. You mayevenutally create a special shell script that will act as if it was the running engine and that will compile and run the code (but this may require additional storage for the compiled code files that will be run).
    Lots of languages (today most of them, including historic ones like Cobol and Fortran) have similar capabilities.

    Other true "scripting" languages that work with Unicode is ECMAScript (aka "JavaScript"), and "VBScript" (on Windows).

    But it is important to understand what the "native" support means. It is not just the capability of compiling a source program written in some Unicode UTF.
    If you consider C for example, the "char" datatype will probably not allow you to store all possible valid Unicode characters (defined in terms of one code point only), but the language should allow you to handle strings made of multiple code units (not "characters" with the Unicode definition) representing streams of Unicode codepoints. With strings you can support any character even if itrequires storing multiple code units. Additionally, even the definition of "characters" would not be enough to support correctly all Unicode algorithms and language/script-sensitive definitions of what is perceived as "characters" by users.

    So the basic requirement for a Unicode-compatible language is not whever it supports characters, but whever it supports strings handledasvectors of code units, with at least one representation compatible with a Unicode UTF. [The requester asked for the support of Strings, so the distinction for characters is not necessary for "Unicode native" languages].

    The most important thing is not the language itself, but the set of functions or standard Unicode transformations and algorithms that is possible to implement with that language. The language itself most often will have no "native" support by itself, but it will be available by a set of libraries accessible from that language.

    A language can be considered with native support if its standard includes the presence of such library. In that case C99, Perl, PHP, have NO native support (although most implementations today include such a library sometimes published according to an additional standard), but Java, J#, C#, ECMAScript, and VBScript do have such "native" support.

    Some do think that a Unicode-native language means that the language includes the support for at least 16-bit code units (16-bit was the maximum when Unicode1.0 was first published and where the distinction with codepoints was still fuzzy). I think it's wrong, because there's an attempt to forcetheinternal representation of strings as simple arrays of fixed-width code units, when in fact the representations between code units, characters, and strings may be independant of each other (or collapse in some pair). For the Unicode standard itself, just supporting the character level is notenough.

    All in the standard is viewed in terms of text, i.e. strings, even if some parts of the standard are restricted to strings of 1 character each one made with 1 codepoint: the most important algorithms of Unicode (normalization, collation, case mappings, equivalences,etc..) all refer to and require the String level, and codepoints are just ways to parse the text so that these algorithms can be implemented with finite number of steps. As well, the correspondance between codepoints and code units is defined through a encoding/decoding process which also requires finite numbersof steps.

    Being "native" in a language just means that programming for that language requires a small number of steps for the implementations of these last processes. But you CANNOT design any language that will handle Unicode text using always a simple one-step process for all algorithms, simply because natural languages written in scripts encoded with Unicode are more complex than just the basic algorithms that have been standardized. All you can do is to provide libraries that will implement the standard algorithms in a simple way. Any library based ONLY on the "character" or code unit or codepoint level is necessarily too limited and cannot guarantee full conformance of programs using it to transform Unicode text.

    ----- Original Message -----
    From: "Samuel Thibault" <samuel.thibault@ens-lyon.org>
    To: "Mike Ayers" <mayers@celequest.com>
    Cc: <unicode@unicode.org>
    Sent: Wednesday, January 18, 2006 10:06 AM
    Subject: Re: Unicode native scripting language?

    > Hi,
    >
    > Mike Ayers, le Tue 17 Jan 2006 21:07:44 -0800, a écrit :
    >> It's now commonplace for scripting languages to "support" Unicode,
    >> but is there one that is truly fluent in Unicode? I want to be able to:
    (...)
    > C can do this: put yourself in a utf-8 locale, run an editor, type:
    (...)



    This archive was generated by hypermail 2.1.5 : Wed Jan 18 2006 - 10:05:01 CST