RE: UTF-8 and string manipulations in Java

From: Phillips, Addison (addison@amazon.com)
Date: Wed Jan 07 2009 - 15:53:20 CST

  • Next message: John H. Jenkins: "Re: Unihan : Traditional characters having two simplified equivalents"

    In:

     String.getBytes()
     OutputStreamWriter (and InputStreamReader) constructors
     Charset (and friends)
     (and so on)

    If you don't specify an encoding, the default encoding or default platform encoding is used.

    Regards,

    Addison

    Addison Phillips
    Globalization Architect -- Lab126

    Internationalization is not a feature.
    It is an architecture.

    > -----Original Message-----
    > From: ktadenev@ups.com [mailto:ktadenev@ups.com]
    > Sent: Wednesday, January 07, 2009 1:47 PM
    > To: Phillips, Addison; unicode@unicode.org
    > Subject: RE: UTF-8 and string manipulations in Java
    >
    > Addison,
    > Thank you very much, this is very helpful!
    > Can you clarify one point:
    > You commented: "...a conversion is performed to an external
    > character encoding which may be UTF-8 or some legacy encoding---or
    > even UTF-16, if you so specify."
    >
    > Question: Where would you normally specify external character
    > encoding?
    >
    > Thank you,
    >
    >
    > Konstantin Tadenev
    > Database Architect
    > Enterprise Information Architecture
    > Phone: (201) 828-4076
    > mailto:ktadenev@ups.com
    > Location: Mahwah (RO3C-123)
    >
    >
    > -----Original Message-----
    > From: Phillips, Addison [mailto:addison@amazon.com]
    > Sent: Wednesday, January 07, 2009 3:36 PM
    > To: Tadenev Konstantin (EXT3TWK); unicode@unicode.org
    > Subject: RE: UTF-8 and string manipulations in Java
    >
    > Hi Konstantin,
    >
    > > 1. java.lang.String expects UTF-8 data and any data manipulations
    > appear to a Java programmer as being performed in UTF-8
    >
    > This is not correct. Java.lang.String is a Unicode string type--an
    > array of UTF-16 code units. That is the internal encoding of String
    > is UTF-16. Some methods exist (post 1.5) for manipulating Unicode
    > code points (i.e. UTF-16 surrogate pairs are treated as a single
    > character).
    >
    > All external data consists of bytes. To create a String, a
    > character encoding must be used to convert the bytes to String's
    > internal encoding (which, as mentioned, is UTF-16). Depending on
    > how you access the data, various character encodings may be the
    > default value. Usually it is best to specify the encoding, as with
    > InputStreamReader, the String ctor, etc.
    >
    > Since you are a database architect, you may mean that data in JDBC
    > is UTF-8. The encoding uses actually depends on the database driver
    > vendor's implementation, although many drivers (such as Oracle's)
    > do use UTF-8 on the wire. With JDBC, the conversion between the
    > database's internal (native) encoding and String's internal UTF-16
    > encoding is invisible, and, in fact, not under programmatic control.
    > Accessing a varchar in the database via JDBC is basically
    > transparent: you read it as a String object from the ResultSet.
    >
    > Finally, Java has a "UTF-8-like" serialization for String objects
    > that is based on UTF-8, but this is internal to Java and should not
    > be confused with either the encoding used by String or with a valid
    > access method for strings.
    >
    > > 2. Internally, when a string manipulation method is invoked (e.g.,
    > length(), charAt(int), etc.), Java converts the string content to
    > UTF-16, performs the requested manipulation and converts the
    > content back to UTF-8. None of this is visible to the Java
    > developer
    >
    > This is not correct. The string content actually is UTF-16 all the
    > time when in a String object. When you extract bytes from a String,
    > a conversion is performed to an external character encoding which
    > may be UTF-8 or some legacy encoding---or even UTF-16, if you so
    > specify. On some platforms, the default platform encoding is UTF-8,
    > but in other cases, it isn't.
    >
    > Hope that helps,
    >
    > Addison
    >
    > Addison Phillips
    > Globalization Architect -- Lab126
    >
    > Internationalization is not a feature.
    > It is an architecture.
    >
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    > On Behalf Of ktadenev@ups.com
    > Sent: Wednesday, January 07, 2009 7:42 AM
    > To: unicode@unicode.org
    > Subject: UTF-8 and string manipulations in Java
    >
    > Hello,
    > I have a question on Java internal data manipulations as they
    > pertain to UTF-8 strings.
    >
    > Are these statements correct?
    >
    > 1. java.lang.String expects UTF-8 data and any data manipulations
    > appear to a Java programmer as being performed in UTF-8
    > 2. Internally, when a string manipulation method is invoked (e.g.,
    > length(), charAt(int), etc.), Java converts the string content to
    > UTF-16, performs the requested manipulation and converts the
    > content back to UTF-8. None of this is visible to the Java
    > developer
    >
    > I would appreciate any insight...
    >
    > Thank you,
    >
    > Konstantin Tadenev
    > UPS
    > Database Architect
    > Enterprise Information Architecture
    > Location RO3C-123
    > 340 McArthur Blvd
    > Mahwah, NJ 07430
    > Phone: (201) 828-4076
    > mailto:ktadenev@ups.com
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Jan 07 2009 - 15:55:31 CST