Re: UTF-8 and string manipulations in Java

From: Johannes Rössel (joey@muhkuhsaft.de)
Date: Wed Jan 07 2009 - 11:10:57 CST

  • Next message: Daniel Ehrenberg: "Word break tests"

    Hello,

    first, this question probably belongs into a Java mailing list, not a
    Unicode one, as it deals with Java specifics, not Unicode per se.

    > 1. java.lang.String expects UTF-8 data and any data manipulations
    > appear to a Java programmer as being performed in UTF-8
    >

    To cite the Java Language Specification, Third Edition (p. 48): “The
    Java programming language represents text in sequence of 16-bit code
    units, using the UTF-16 encoding. A few APIs, primarily in the Character
    class, use 32-bit integers to represent code points as individual
    entities. The Java paltform provides methods to convert between the two
    representations.”

    No reference to UTF-8 is made anywhere within the specification.

    The prevalent encoding for source code files that use Unicode directly,
    is probably UTF-8. Though the conversion of string literals into UTF-16
    is done by the compiler here.

    I am not sure what exactly you mean by “appear to a Java programmer as
    being performed in UTF-8”. String processing will always be done on the
    string, or on substrings of characters. No relationship whatsoever is
    based on the bytes that make up the string, if that's what you mean.
    within strings you may have to deal with high or low surrogate code
    units (U+D800–U+DFFF), though (not sure, since I never tried).

    > 2. Internally, when a string manipulation method is invoked
    > (e.g., length(), charAt(int), etc.), Java converts the string content
    > to UTF-16, performs the requested manipulation and converts the
    > content back to UTF-8. None of this is visible to the Java developer
    >

    Not that I know of. A Java implementation which behaves like this
    probably violates the specification in the quoted section above.

    Regards,
    Johannes



    This archive was generated by hypermail 2.1.5 : Wed Jan 07 2009 - 11:14:09 CST