Re: Java UTF-8 String

From: Glen Perkins (gperkins@netcom.com)
Date: Thu Jul 03 1997 - 16:30:23 EDT


Ammendment:

Siddhant Kaul <kaulsn@acs.wooster.edu> wrote:
>
> Is there any way to construct a string using utf8 charactrs? I tried it
> using the following code:
>
> public static byte [] convert(byte [] inBytes, String inEnc, String
> outEnc)
> throws UnsupportedEncodingException
> {
> return new String(inBytes, inEnc).getBytes(outEnc);
> }
> and then cosntructing the string using the String(byte b[])
> constructor. It didn't seem to work, returning a null string

It just occurred to me that is more than one interpretation of your
question. I thought you were asking if it were possible, and it appears
as though you are trying, to force Java to change the encoding it used
internally for Strings. As I said, it is not possible to do so, which is
a Good Thing. How Java represents its String data internally is its
business.

Another possible interpretation of your question, though, is simply, "is
it possible to create a Java String from a byte array if that byte array
represents a series of characters encoded in UTF-8?"

Yes, it certainly is, and it's one of the nicest features of Java for
i18n fans. You just do this:

String myString = new String(myUTF8ByteArray, "UTF8");

That would take your UTF-8 bytes and turn them into a valid Java String
containing the chars you expected.

BTW, in the code you were trying to use, you were using your "convert"
method to convert a byte array in one encoding into a byte array in
another encoding. You were then telling Java to turn the resulting byte
array into a String, but you weren't telling the String constructor what
the encoding of that byte array was. You were using the String
constructor that instructs Java to assume that the byte array is a
series of characters encoded in your user's system's *default encoding*,
whatever that is. It would be different from user to user, succeeding on
some machines and failing on others.

If outEnc were "UTF8", but your user were using US Win95 (i.e., default
encoding were ISO 8859-1), for example, you'd be telling the String
constructor to interpret the bytes in a UTF-8 stream as if it were a
stream of ISO 8859-1. Naturally, that would fail.

If your code is going to deal with multiple encodings, always decide
carefully whether your intention is to use a specific, fixed encoding,
or whether you want to use the user's default encoding. If the former,
use the constructors that specify the encoding as an explicit argument.
Don't use the constructors that leave out the encoding argument without
thinking things through. There is *always* an encoding, whether you
specify it or not. If you don't specify it, it will assume that you want
to use the default encoding on the user's machine, which is
platform-specific, i.e. different for every user. Sometimes that's what
you want, but often it is not.

__Glen Perkins__
glen.perkins@nativeguide.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT