Re: Encoding converion through JDBC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jun 04 2003 - 11:41:52 EDT

  • Next message: Marion Gunn: "Re: Address of ISO 3166 mailing list"

    First I did not use the term "SQL Server", but "RDBMS". We spoke about JDBC which is made to be independant of any RDBMS (SQL Server included, which you should better designate as "Microsoft SQL Server", because Oracle also is, among other things, a SQL server).

    Also, UCS-2 is officially dead! You won't find any use of it now in ISO/IEC 10646 (where it was first used, and has now been replaced by "UCS-4" in some ISO/IEC 10646 documents, and by the generic term "code point" in both Unicode and newer 10646 documents).

    Don't speak about UCS-2! Microsoft SQL Server does not (and will no more) support it. Instead refer to UTF-16 encoding form or to UTF-16* encoding schemes, or to codepoints in the BMP, which are less ambiguous!

    Whatever your think, UCS-2 just means all the valid characters in the BMP, and it defines no definition in term of bits for its storage (so an UCS-2 character can be handled in a UTF-32 encoding form or stored/transmitted in one of the three UTF-32* encoding schemes). This is just a character set made of a bijective association between some astract characters (with defined properties) and an integer (in a limited range of values).

    It's true that up to Unicode 3.1, all characters where only assigned in UCS-2 only, but the BMP also contains since long code points assigned to surrogates, and the fact that RDBMS systems could be used to store unpaired surrogates is just a legacy history where such check was relaxed (simply because surrogates were still not needed). Some non-conforming applications have used these surrogates as if they were PUA without encoding restrictions. Such applications need to be upgraded to use new PUA allocated in planes 15 and 16 with correctly paired surrogates. This supposes a database migration if they upgrade their server software, and possibly some new disk space to remap these illegal uses, or the application and database must be kept to only accept a past version of Unicode...

    All RDBMS systems now will be enforcing the correct pairing of surrogates, even if they continue to use UTF-16 for their storage. But from the JDBC view, this is coherent with what Java does, because the "String" and class and native "char" type or "Character" class currently only allow storing 16-bits per characters. This may change soon, notably on 64 bit platforms (look precisely in the Java documentation, the size of a Java "char" is not explicitly specified and it could be larger, including in the bytecode instructions !

    Some Java classes that assume that the "char" arithmetic will automatically roll after 16 bits are wrong. The JVM spec only requires that char be at least 16-bit wide (but it may be larger). The compiled classes need to store string constants. But these constants are serialized to be platform independant using a UTF-8 encoding scheme.

    Your sentence:
        the "N" prefix fields *always* mean UCS-2 for MSSQLS, period.
    is wrong notably from the JDBC view, simply because we don't care about physical types on the server or networking protocol or middleware, but about JDBC datatypes. Look at Sybase (from which MSSQLS inherited its client library, its SQL language and syntax, and list of supported datatypes), you'll see that Sybase correctly handles ALL Unicode characters with NVARCHAR simply because it correctly uses the UTF-16 encoding and checks it in all its conversion libraries.

    MSSQLServer will also perform these checks when converting a client encoding to storage, because it internally uses the MultiByteToWideChar() Win32 API (and its reversed API) to perform this job. These APIs *will* throw exceptions that the SQL Server will use to detect invalid UTF-16 sequences, if the server is tuned to do so, or if the client does not use UTF-16 but another encoding, or if the client uses an UTF-8 encoding with sequences of more that 3 bytes on a server or database that was not marked as being Unicode 3.0+ compatible. Look at the notices for Chinese users which most often use a codepage-9xx environment on Windows, and how Microsoft made its system and servers comply with the required support of GB18030 standard since 2000...

    You should also look at upgrade notices in MSSQLServer releases. Your assumption are wrong if you just consider the relaxed encoding rules permitted in previous versions. For migration purpose, there exists tools that will check if a old database can be marked to support a more recent version of Unicode, and that will help a dbadmin diagnose the content of tables and SQL stored procs or triggers that need updates. If all tests pass, the database will be marked to the new version and it will be possible to clients to use a higher version of Unicode without experimenting conversion problems or alerts...

    Form the JDBC point of view, these are driver-specific details. All that is needed in JDBC is to map a RDBMS type to some known Java classes, and be able to use instances of these types when "discussing" with a RDBMS, and handling possible conversion errors by mapping them to Java exceptions.

    So use the Java String class type (which really is UTF-16 for now, but may be extended to UTF-32 on future 64-bit platforms), and ignore the other encoding details as this job is part of each JDBC driver (for now all of them will build String instances using UTF-16, but it will probably soon be possible to indicate to the JDBC driver the preference to UTF-32 in order to remove the handling of surrogates in the Java application).

    Sun is discussing a lot the best way to standardize a common API for the full support of Unicode 3.2+, with a coherent set of methods for Strings, in a way that would reak the lowest number of applications (including those that made false assumptions about the size of a native "char"; some existing String APIs will be officially deprecated because they return a "char" instead of a int, or because they return indexes interms of code units and not in terms of code points).

    The probable official full support of Unicode 4 and 3.2 will come with new classes derived from Character and String (UChar and UString are their name in the IBM ICU package, but Sun may also keep the class name but designate them under the java.text package insteads of the core's java.lang package, and a compiler option (such as the target Java version) may allow a class author to compile its code according to the default java.lang.String or java.text.String class if the package name is not specified by an explicit import).

    -- Philippe.
    ----- Original Message -----
    From: "Michael (michka) Kaplan" <michka@trigeminal.com>
    To: "Philippe Verdy" <verdy_p@wanadoo.fr>
    Sent: Wednesday, June 04, 2003 4:36 PM
    Subject: Re: Encoding converion through JDBC

    > From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    >
    > Phillipe, you went on for quite a while and I admit most of the things you
    > talked about are not thing about which I have knowledge. But some of the
    > things you talked about, I do understand, and in those cases you were wrong.
    > Psychologically, it causes me to wonder how much of the rest of this message
    > converys accurate information.
    >
    > Specifically, you talk about SQL Server but most of what you said about it
    > is inaccurate. You cannot stored big endian data without risking corruptipn,
    > you can only store UCS-2, it is not surrogate aware can can thus be said to
    > truly support onlu UCS-2, not UTF-16, and the "N" prefix fields *always*
    > mean UCS-2 for MSSQLS, period.
    >
    > You have a gift -- that of being able to speak knowledgably. But please, use
    > that gift for *good* and do not move past what you know.
    >
    > Please, think about it?
    >
    > MichKa
    >



    This archive was generated by hypermail 2.1.5 : Wed Jun 04 2003 - 12:34:15 EDT