RE: Java, UCS-2, and UTF

From: Mike Brown (mbrown@corp.webb.net)
Date: Wed May 10 2000 - 18:05:51 EDT


> > First of all, does anyone know how Java encodes its chars?
> > I'm under the impression that it's UCS-2.
>
> A Java char is a 16-bit value representing a single UTF-16 code point.

I've been wondering about this myself. If it were truly UTF-16, I would
assume there would be checks for surrogate pairs and illegal sequences. I
don't believe Java does this. As it stands, at least in JDK 1.1.8, a UTF-8
serialization of a String "\uD800\uDC00" is the byte sequence ED A0 80 ED B0
80 (the UTF-8 form of D800 and DC00), rather than F0 90 80 80 (the UTF-8
form of 10000). This makes it more like UCS-2, although I suspect it would
be inaccurate to say that, as well.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT