RE: UTF-8 and string manipulations in Java

From: Phillips, Addison (
Date: Sun Jan 11 2009 - 18:29:58 CST

  • Next message: Michael Everson: "Re: Emoji: emoticons vs. literacy"

    > > Finally, Java has a "UTF-8-like" serialization for String
    > > objects that is based on UTF-8, but this is internal to Java
    > > and should not be confused with either the encoding used by
    > > String or with a valid access method for strings.
    > It is NOT invisible to Java programmers given that this specific
    > encoding is
    > exposed in the API driving the format of compiled java classes (see
    > the JVM
    > specification) // .... etc.....

    I think this is massively confusing to do more than mention. Most Java programmers will never encounter the serialization scheme, since it requires that you access bytes You Are Not Meant To Touch. Yes, C programmers will encounter "modified UTF-8" [which is the point of using this weird encoding---so that a Java String with a 'null' can be a C string with a null only at the end], but rarely will Java programmers encounter it if they write clean code. They see String objects.

    After all, you go on to mention:

    > The modified UTF-8 is just used there as a serialization of the
    > UTF-16
    > internal storage (exposed in the String methods) onto a stream of
    > bytes for
    > use strictly with Java, it is not meant for interchange, except
    > within the
    > transport of precompiled Java classes with the usual Java class
    > format.

    Exactly so.

    I didn't say that the serialization scheme didn't exist. In fact, I went out of my wait to point it out for those pedantic enough to require me to note it :-)*, as well as a starting point if Konstantin had written back and mentioned that he was having a problem with, say, his JNI calls. In practice, the vast preponderance of Java programmers will never even know that this encoding exists except via emails such as this.


    Addison Phillips
    Globalization Architect -- Lab126

    Internationalization is not a feature.
    It is an architecture.

    *> No emoji intended here ;-).

    This archive was generated by hypermail 2.1.5 : Sun Jan 11 2009 - 18:32:32 CST