RE: UTF-8 and string manipulations in Java

From: ktadenev@ups.com
Date: Wed Jan 07 2009 - 15:46:56 CST

  • Next message: Phillips, Addison: "RE: UTF-8 and string manipulations in Java"

    Addison,
    Thank you very much, this is very helpful!
    Can you clarify one point:
    You commented: "...a conversion is performed to an external character encoding which may be UTF-8 or some legacy encoding---or even UTF-16, if you so specify."

    Question: Where would you normally specify external character encoding?

    Thank you,

    Konstantin Tadenev
    Database Architect
    Enterprise Information Architecture
    Phone: (201) 828-4076
    mailto:ktadenev@ups.com
    Location: Mahwah (RO3C-123)

    -----Original Message-----
    From: Phillips, Addison [mailto:addison@amazon.com]
    Sent: Wednesday, January 07, 2009 3:36 PM
    To: Tadenev Konstantin (EXT3TWK); unicode@unicode.org
    Subject: RE: UTF-8 and string manipulations in Java

    Hi Konstantin,

    > 1. java.lang.String expects UTF-8 data and any data manipulations appear to a Java programmer as being performed in UTF-8

    This is not correct. Java.lang.String is a Unicode string type--an array of UTF-16 code units. That is the internal encoding of String is UTF-16. Some methods exist (post 1.5) for manipulating Unicode code points (i.e. UTF-16 surrogate pairs are treated as a single character).

    All external data consists of bytes. To create a String, a character encoding must be used to convert the bytes to String's internal encoding (which, as mentioned, is UTF-16). Depending on how you access the data, various character encodings may be the default value. Usually it is best to specify the encoding, as with InputStreamReader, the String ctor, etc.

    Since you are a database architect, you may mean that data in JDBC is UTF-8. The encoding uses actually depends on the database driver vendor's implementation, although many drivers (such as Oracle's) do use UTF-8 on the wire. With JDBC, the conversion between the database's internal (native) encoding and String's internal UTF-16 encoding is invisible, and, in fact, not under programmatic control. Accessing a varchar in the database via JDBC is basically transparent: you read it as a String object from the ResultSet.

    Finally, Java has a "UTF-8-like" serialization for String objects that is based on UTF-8, but this is internal to Java and should not be confused with either the encoding used by String or with a valid access method for strings.

    > 2. Internally, when a string manipulation method is invoked (e.g., length(), charAt(int), etc.), Java converts the string content to UTF-16, performs the requested manipulation and converts the content back to UTF-8. None of this is visible to the Java developer

    This is not correct. The string content actually is UTF-16 all the time when in a String object. When you extract bytes from a String, a conversion is performed to an external character encoding which may be UTF-8 or some legacy encoding---or even UTF-16, if you so specify. On some platforms, the default platform encoding is UTF-8, but in other cases, it isn't.

    Hope that helps,

    Addison

    Addison Phillips
    Globalization Architect -- Lab126

    Internationalization is not a feature.
    It is an architecture.

    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of ktadenev@ups.com
    Sent: Wednesday, January 07, 2009 7:42 AM
    To: unicode@unicode.org
    Subject: UTF-8 and string manipulations in Java

    Hello,
    I have a question on Java internal data manipulations as they pertain to UTF-8 strings.

    Are these statements correct?

    1. java.lang.String expects UTF-8 data and any data manipulations appear to a Java programmer as being performed in UTF-8
    2. Internally, when a string manipulation method is invoked (e.g., length(), charAt(int), etc.), Java converts the string content to UTF-16, performs the requested manipulation and converts the content back to UTF-8. None of this is visible to the Java developer

    I would appreciate any insight...

    Thank you,

    Konstantin Tadenev
    UPS
    Database Architect
    Enterprise Information Architecture
    Location RO3C-123
    340 McArthur Blvd
    Mahwah, NJ 07430
    Phone: (201) 828-4076
    mailto:ktadenev@ups.com



    This archive was generated by hypermail 2.1.5 : Wed Jan 07 2009 - 15:49:27 CST