Re: ASCII escapes for Unicode

From: Andrew Miller (A.J.Miller@bcs.org.uk)
Date: Thu Apr 23 2009 - 06:40:45 CDT

Next message: Andreas Stötzner: "Encoding of symbols"

Previous message: verdy_p: "re: ASCII escapes for Unicode"
Maybe in reply to: William J Poser: "ASCII escapes for Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

You need to use surrogate pairs to enter non-BMP characters. The database
that I tested was using the UTF8 character set (which is actually CESU-8 -
AL32UTF8 is the correct UTF-8 character set in Oracle).

VARCHAR2 columns were storing the non-BMP characters as CESU-8, while
NVARCHAR2 columns were using UTF-16LE format (you can use the DUMP command
to see the raw data in a field). Presumably this would be different if
using AL32UTF8 or one of the other Unicode character sets.

2009/4/23 verdy_p <verdy_p@wanadoo.fr>

> "Andrew Miller" wrote:
> > Oracle uses \xxxx within the UNISTR function (see
> >
> http://download.oracle.com/docs/cd/B25329_01/doc/appdev.102/b25108/xedev_global.htm#i1006315
> > ).
>
> Not just with this function. The syntax is valid for any string literal. It
> just means that the syntax is used to
> represent UTF-16 code units as hexadecimal in the literal syntax. That's
> because literals are actually of type CHAR
> and encoded within the client's national set, and not necessarily the
> database's national set. The interest of this
> cunction is that the SQL query can be created and transported in this
> syntax including through proxying servers
> without change.
>
> Oracle automatically converts the literal to the internal Unicode set. What
> UNISTR performs is to convert any
> Unicode string value into the national set in use within the database (for
> the CHAR datatype). Its parameter can be
> given as any NLSCHAR value, including literals that are already coded in
> the SQL source syntax with their actual
> codepoints (or code positions in the client national set). If the national
> set in use within the database is
> already a Unicode UTF, the function performs no conversion, and can cause
> no loss or error, the parameter is simply
> used as is.
>
> Be careful when working with Oracle databases: you have possibly up to 3
> simultaneous charsets in use: one in your
> application that "speaks" to the client SQL library (generally the one your
> applkcation uses for its GUI or data
> exports and reports), one that is used for the session, one that is used
> internally within the database for the
> storage. The conversions that occur between each layer are quite complex
> and can be performed on the server and/or
> the client (in addition, with proxied remote databases, the Oracle engine
> may "speak" with other engines, using its
> own specific settings for the sessions between them.)
>
> There are several ways that the server or client can handle the conversion
> errors that occur when there's no
> roundtrip compatibility: fallbacks or errors. These are set through session
> parameters or with SQL client library
> APIs to set the client context, and they depend on the capability of the
> client host to perform the conversions
> itself (who makes the actual conversions is determined by negotiation when
> setting up the session).
>
> The Oracle documentation is quite extensive about NLS support (several
> books)...
>
> However I don't know now if you can represent non-BMP characters with this
> syntax (do you need to use pairs of
> surrogates? Or is there a simpler syntax using also '\' followed by some
> distinctive code like 'U' before the 32-
> bit hex value? My use of Oracle has never been in this area so far, because
> I never needed Oracle to store non-BMP
> characters). I'll check this when I'll have access to the books (my locally
> installed databases are not using
> Unicode as its national set).
>
>
>

Next message: Andreas Stötzner: "Encoding of symbols"
Previous message: verdy_p: "re: ASCII escapes for Unicode"
Maybe in reply to: William J Poser: "ASCII escapes for Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Apr 23 2009 - 06:46:10 CDT