Re: Java UTF-8 String

From: Glen Perkins (gperkins@netcom.com)
Date: Thu Jul 03 1997 - 13:39:58 EDT


Siddhant Kaul <kaulsn@acs.wooster.edu> wrote:
 
> Is there any way to construct a string using utf8 charactrs? I tried it
> using the following code:
>
> public static byte [] convert(byte [] inBytes, String inEnc, String
> outEnc)
> throws UnsupportedEncodingException
> {
> return new String(inBytes, inEnc).getBytes(outEnc);
> }
> and then cosntructing the string using the String(byte b[])
> constructor. It didn't seem to work, returning a null string
> Siddhant Kaul
> Student researcher,
> The college of Wooster

A Java "String" is a series of Java "char"s. It's not a series of bytes,
as far as we programmers are concerned. You don't get to choose the
internal encoding of a Java String.

In a sense, it's none of your business how Java chooses to represent its
Strings internally, and it won't let you tell it how to do so. If you
want it to convert a series of bytes into a Java String, you just tell
it the encoding of those bytes, then you stand back and let Java "do the
right thing" in converting those bytes into its internal String format.
What do you care what that internal format is, right, as long as it
doesn't lose any data? You tell it how to interpret your series of
bytes, either by default or by explicitly specifying an encoding, then,
from your perspective, the String automatically becomes the correct
series of chars.

When you want to use the String for something internal, you just use the
various String methods. You still don't care what the encoding is. You
just treat it as a series of characters, and don't worry about how those
characters are being represented. Let Java worry about that.

When you want the String outputted, though, then you care. At that
point, you reverse the process. You tell Java, either explicitly or by
default, the format in which you want those chars outputted. If it's a
graphical format, you tell Java to "drawString()" and the right glyphs
are painted on the screen. Again, what do you care what the internal
encoding was if you get the right glyph on the screen? If you want your
String outputted as a series of bytes--into a file, for example--you
tell Java the encoding you want, and out comes a series of bytes (not a
series of "char"s this time, but a series of bytes) in the encoding you
requested. That's all you need.

I can't imagine when there would be a need to work directly with UTF-8
encoded data internally within your program, because the converters will
do the work for you on input and output. If, however, you do have such a
need, then you can't work with a String. You'll have to work with a byte
array.

Also, if you want to use an encoding that is not supported by Java's
built-in converters, you'll need to either do everything with byte
arrays, or wait for Java's upcoming converter API.

__Glen Perkins__
glen.perkins@nativeguide.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT