Re: Encoding converion through JDBC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jun 04 2003 - 10:04:13 EDT

  • Next message: Philippe Verdy: "Re: Encoding converion through JDBC"

    In all major databases, the native encodings of the OS, of the database when it was created, of the networking protocol, of the SQL queries and results, and of the client application are all independant.

    When JDBC connects to a database, it gets a lot of environment information from the server (true for Oracle, Sybase, MSSQL, Informix, DB2) and the client library manages the conversion to the best encoding supported by both the client and the server in order to minimize transcoding fallbacks. As such, the networking protocol is not always using an Unicode encoding scheme, but may use the same encoding for the client and server, if both support this encoding.

    Encoding negociation if part of the RDBMS-specific networking protocol (SQL*Net or SQL*NetV2 for Oracle, TDS for Sybase and MSSQL, IFX-Net for Informix, ... I'm not sure about other RDBMS engines and protocols). These protocols exist since long even before Unicode and ISO10646 were standardized, at a time where the management of multiple encodings was really needed. Unicode encoding forms are just supplementary options for this negociation.

    Generally the solution adopted will be to use the needed conversion transparently, but it supposes some capability of the server table's to store these characters, and some support for the client to limit fallbacks (depending on the RDBMS, such fallbacks may be done without any explicit indication to the client application, or it can produce SQL errors, or the application can specify the behavior using triggers either in the client library or through administrative SQL commands for the session.

    I's not JDBC iself hat performs these adaptations: most of the work is performed directly in the native client library (preferably as it scales better and offers more control in the application) of the specific RDBMS engine, with some conversions sometimes accepted by the server (but most servers will just raise and exception or SQL error that will invalidate the transaction if a network encoding is no conerible to the encoding of the storage tables).

    Note also: you use the term "UCS-2" inaccurately. NVARCHARs are storing code units, not code points. UCS-2 is designating only the subrange of valid characters in the BMP. In fact, you should have said that NVARCHARs are just defined to allow storing international characters. Depending on the way the database was created, this may store strings as UTF-8 or UTF-16.

    If the database is portable and architecture independant, the native byte order of the running platform may not be decisive, bu instead will be specified by a global flag stored in the table descriptor or in the master catalog of the database (so that a database can be closed and its storage remounted to another server without requiring a lengthy import).

    For Oracle and Sybase or MSSQL, NVARCHARs are dababase types, but they may be constrained or tuned during the database creation, so that code units will be 8 bit or 16 bit. This however does not change the length constraint, which is always considered in terms of logical characters (and checked only like run-time triggers) and not in terms of absolute byte length (so NVARCHARs and even NCHARs are stored as varying length byte sequences).

    So N(VAR)CHARs and (VAR)CHARs only differ logically by the fact that the set of allowed characters for N(VAR)CHARs must *include* the set of allowed characters for (VAR)CHARs (and thus *may* be larger). Both sets may or may not be equal and each one may have a distinct encoding in terms of code units.

    There is no guarantee that NVARCHAR will always accept all Unicode characters (this condition can be restricted by database creation constraints or tuning), and a database may be created so that it will accept the whole Unicode character set in CHARs or VARCHARs (in that case NCHARs or NVARCHARs will be most often equivalent in terms of processing and storage).

    The main interest of N(VAR)CHARs (generally implemented to support all Unicode characters) is that it allows fine tuning of the storage requirements by separating strings that are not designed to be translated (such as referencial keys) and those that will store localizable data (including secondary search keys).

    The length constraint on N(VAR)CHAR is just adding processing time, because such datatype is almost always requiring a varying length for its storage (this is why it is often not adequate for referencial keys). But it is still better than using text LOBs for performance reasons (and also for storage requirement reasons, as most LOBs will allocate space with a large granularity of 1 database page for each text/string, most often set to 2KB or 4KB, or because modification of LOBs may be excluded from the "rollbackable" transactions).

    In a pure logical design, all these variants of CHAR, VARCHAR, NCHAR, NVARCHAR, and TEXT LOBs are equivalent. They all designate text, but these multiple RDBMS datatypes are just adding opportunities for specific tuning according to some constraints, or just add a distinction in the handling of he implicit space padding (which you would expect logically when exporting to a fixed file or when comparing keys, but may or may not be physically stored, depending on RDBMS-specific tuning).

    For all these reasons, such tuning of encoding depends mostly of the capability of the JDBC driver. In Java, you will always use Strings internally handled as UTF-16 code units with most APIs, except in serialization of class files which use UTF-8 for String constants before they are loaded and initialized at run-time. JDBC itself does not make this conversion.

    This is the work of the JDBC driver, which can use conversion tools and tables from the RDBMS server or from its native client library, or from the Java toolkit. For performance reasons, each driver has the choice of how it will perform it. So normally you don't have to worry about that in all transactional requests, except if you create a database through JDBC.

    (This should rarely be the case, as there's too many RDBMS specific features to tune up according to an application schema and storage architecture and security, so if you use JDBC, this Java code will use RDBMS-specific requests that will not be portable to other architectures, and should be reserved to building an application for DB-admins, and the choice of the RDBMS will be critical for such application, if you depend a lot on it, as it will limit your deployment possibilities, and future scalability on other shared platforms, or shared application design tools and teams).

    The common JDBC reference will not show you the exact details. In fact, the details of encodings supported and prefereably selected for your JDBC application is specific to each RDBMS. But I really think that the encoding of strings is an internal consideration that JDBC applications should not be exposed to (so I would not like to use an encoding specifier in an URL, as this is not the role of an URL to specify it, but instead the role of the client middleware to negotiate the best encoding supported by both the client application and the remote server, so that the same URL can be used across platforms).

    Because of this consideration, I think that this discussion is no relevant in this newsgroup. This problem is not related to Unicode but to the tuning and portability of JDBC middlewares and applications. I would be better to continue this discussion in the online Sun forums or lists for Java related to JDBC.

    (Unicode is just a minor aspect and non mondatory response to your question, and your real problem is not Unicode, but the choice of the best JDBC drivers or RDBMS engines for your application needs or its deployment; as part of your question you should include the problem of the administration of your database, and the cost and ease with which you'll be ale to guarantee its security and performance, and how you'll scale it. This goes to far for what can be discussed here, and we won't speak further about JDBC and its relevance as a solution for your development and deployment).

    -- Philippe.
    ----- Original Message -----
    From: "souravm" <souravm@infosys.com>

    Hi All,
     
    I've some query on how JDBC is supposed to convert encoding of
    characters while updating/selecting multibyte strings to/from a Database
    supporting Unicode characters. The specific queries are like ...



    This archive was generated by hypermail 2.1.5 : Wed Jun 04 2003 - 10:52:17 EDT