Re: ASCII escapes for Unicode

From: Andrew Miller (A.J.Miller@bcs.org.uk)
Date: Thu Apr 23 2009 - 06:40:45 CDT

  • Next message: Andreas Stötzner: "Encoding of symbols"

    You need to use surrogate pairs to enter non-BMP characters. The database
    that I tested was using the UTF8 character set (which is actually CESU-8 -
    AL32UTF8 is the correct UTF-8 character set in Oracle).

    VARCHAR2 columns were storing the non-BMP characters as CESU-8, while
    NVARCHAR2 columns were using UTF-16LE format (you can use the DUMP command
    to see the raw data in a field). Presumably this would be different if
    using AL32UTF8 or one of the other Unicode character sets.

    2009/4/23 verdy_p <verdy_p@wanadoo.fr>

    > "Andrew Miller" wrote:
    > > Oracle uses \xxxx within the UNISTR function (see
    > >
    > http://download.oracle.com/docs/cd/B25329_01/doc/appdev.102/b25108/xedev_global.htm#i1006315
    > > ).
    >
    > Not just with this function. The syntax is valid for any string literal. It
    > just means that the syntax is used to
    > represent UTF-16 code units as hexadecimal in the literal syntax. That's
    > because literals are actually of type CHAR
    > and encoded within the client's national set, and not necessarily the
    > database's national set. The interest of this
    > cunction is that the SQL query can be created and transported in this
    > syntax including through proxying servers
    > without change.
    >
    > Oracle automatically converts the literal to the internal Unicode set. What
    > UNISTR performs is to convert any
    > Unicode string value into the national set in use within the database (for
    > the CHAR datatype). Its parameter can be
    > given as any NLSCHAR value, including literals that are already coded in
    > the SQL source syntax with their actual
    > codepoints (or code positions in the client national set). If the national
    > set in use within the database is
    > already a Unicode UTF, the function performs no conversion, and can cause
    > no loss or error, the parameter is simply
    > used as is.
    >
    > Be careful when working with Oracle databases: you have possibly up to 3
    > simultaneous charsets in use: one in your
    > application that "speaks" to the client SQL library (generally the one your
    > applkcation uses for its GUI or data
    > exports and reports), one that is used for the session, one that is used
    > internally within the database for the
    > storage. The conversions that occur between each layer are quite complex
    > and can be performed on the server and/or
    > the client (in addition, with proxied remote databases, the Oracle engine
    > may "speak" with other engines, using its
    > own specific settings for the sessions between them.)
    >
    > There are several ways that the server or client can handle the conversion
    > errors that occur when there's no
    > roundtrip compatibility: fallbacks or errors. These are set through session
    > parameters or with SQL client library
    > APIs to set the client context, and they depend on the capability of the
    > client host to perform the conversions
    > itself (who makes the actual conversions is determined by negotiation when
    > setting up the session).
    >
    > The Oracle documentation is quite extensive about NLS support (several
    > books)...
    >
    > However I don't know now if you can represent non-BMP characters with this
    > syntax (do you need to use pairs of
    > surrogates? Or is there a simpler syntax using also '\' followed by some
    > distinctive code like 'U' before the 32-
    > bit hex value? My use of Oracle has never been in this area so far, because
    > I never needed Oracle to store non-BMP
    > characters). I'll check this when I'll have access to the books (my locally
    > installed databases are not using
    > Unicode as its national set).
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Apr 23 2009 - 06:46:10 CDT