Re: Java, UCS-2, and UTF

From: Mark Davis (markdavis@ispchannel.com)
Date: Wed May 10 2000 - 22:41:08 EDT


Two corrections.

> A Java char is a 16-bit value representing a single UTF-16 code point.

A Java char is a 16-bit value representing a single UTF-16 code UNIT.

(or code value - not code point aka scalar value). See http://www.unicode.org/glossary/

> don't believe Java does this. As it stands, at least in JDK 1.1.8, a UTF-8
> serialization of a String "\uD800\uDC00" is the byte sequence ED A0 80 ED B0
> 80 (the UTF-8 form of D800 and DC00), rather than F0 90 80 80 (the UTF-8
> form of 10000). This makes it more like UCS-2, although I suspect it would
> be inaccurate to say that, as well.
>
This is incorrect.

Try http://www.macchiato.com/mark/UnicodeConverter/index.html on any recent browser (though recent, Navigator and Explorer still run 1.1.x).

Pick UTF-32 in one box, UTF8 in another. Type F4 8F BF BF in the UTF8 box. You will see DBFF DFFF in the Unicode box and 00 10 FF FF in the UTF-32 box.

The internal code uses the standard encoding parameter on String and Streams, as on
http://java.sun.com/products/jdk/1.1/docs/api/java.lang.String.html#getBytes(java.lang.String)
or
http://java.sun.com/products/jdk/1.1/docs/api/java.io.OutputStreamWriter.html#OutputStreamWriter(java.io.OutputStream, java.lang.String)

Do not use http://java.sun.com/products/jdk/1.1/docs/api/java.io.DataOutputStream.html#writeUTF(java.lang.String)

if you really want UTF-8. This generates not only a variant of UTF-8, but also prepends a length.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT