Re: Encoding converion through JDBC

From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Jun 04 2003 - 13:14:08 EDT

  • Next message: Peter_Constable@sil.org: "Re: IPA Null Consonant"

    A few items:

    I agree with your main point, which is that UCS-2 is, for all
    practical purposes, just a repertoire subset of UTF-16; the code units
    and bit-width are the same.

    > Some Java classes that assume that the "char" arithmetic will
    automatically roll after 16 bits are wrong. The JVM spec only requires
    that char be at least 16-bit wide (but it may be larger). The compiled
    classes need to store string constants. But these constants are
    serialized to be platform independant using a UTF-8 encoding scheme.

    I'm in the JSR 204 group looking at supplementary character support.
    Although I won't speak to the details of the discussions in that
    group, it is quite unlikely that char would be changed to be 32-bits.
    It would break far too much.

    > The probable official full support of Unicode 4 and 3.2 will come
    with new classes derived from Character and String (UChar and UString
    are their name in the IBM ICU package, but Sun may also keep the class
    name but designate them under the java.text package insteads of the
    core's java.lang package, and a compiler option (such as the target
    Java version) may allow a class author to compile its code according
    to the default java.lang.String or java.text.String class if the
    package name is not specified by an explicit import).

    In ICU4J (which is an add-on package for Java), we don't have classes
    UChar and UString. For supplementary support, we have:

    - UCharacter, which provides property functions based on code
    points -- rather than chars (It also has all the UCD properties
    instead of just the small fraction that are in the standard JDK.)

    - UTR16, which provides utilities for using supplementaries with
    String, StringBuffer and char[]

    The other functionality, such as Normalizer, UnicodeSet, Collator,
    StringSearch, Transliterator, etc. all handle supplementary
    characters.

    See http://oss.software.ibm.com/icu4j/doc/index.html for details.

    BTW, I only very quickly scan long documents, such as those that you
    and a few others are blessed with the ability to produce. So there may
    be other items that I don't catch.

    Marc

    > -- Philippe.
    > ----- Original Message -----
    > From: "Michael (michka) Kaplan" <michka@trigeminal.com>
    > To: "Philippe Verdy" <verdy_p@wanadoo.fr>
    > Sent: Wednesday, June 04, 2003 4:36 PM
    > Subject: Re: Encoding converion through JDBC
    >
    >
    > > From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    > >
    > > Phillipe, you went on for quite a while and I admit most of the
    things you
    > > talked about are not thing about which I have knowledge. But some
    of the
    > > things you talked about, I do understand, and in those cases you
    were wrong.
    > > Psychologically, it causes me to wonder how much of the rest of
    this message
    > > converys accurate information.
    > >
    > > Specifically, you talk about SQL Server but most of what you said
    about it
    > > is inaccurate. You cannot stored big endian data without risking
    corruptipn,
    > > you can only store UCS-2, it is not surrogate aware can can thus
    be said to
    > > truly support onlu UCS-2, not UTF-16, and the "N" prefix fields
    *always*
    > > mean UCS-2 for MSSQLS, period.
    > >
    > > You have a gift -- that of being able to speak knowledgably. But
    please, use
    > > that gift for *good* and do not move past what you know.
    > >
    > > Please, think about it?
    > >
    > > MichKa
    > >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Jun 04 2003 - 14:15:22 EDT