RE: Java, UCS-2, and UTF

From: Mike Brown (mbrown@corp.webb.net)
Date: Wed May 10 2000 - 18:05:51 EDT

Next message: Mark Davis: "Re: Java, UCS-2, and UTF"
Previous message: Pete Resnick: "Word spacing in HTML"
Maybe in reply to: Everett Anderson: "Java, UCS-2, and UTF"
Next in thread: Mark Davis: "Re: Java, UCS-2, and UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> > First of all, does anyone know how Java encodes its chars?
> > I'm under the impression that it's UCS-2.
>
> A Java char is a 16-bit value representing a single UTF-16 code point.

I've been wondering about this myself. If it were truly UTF-16, I would
assume there would be checks for surrogate pairs and illegal sequences. I
don't believe Java does this. As it stands, at least in JDK 1.1.8, a UTF-8
serialization of a String "\uD800\uDC00" is the byte sequence ED A0 80 ED B0
80 (the UTF-8 form of D800 and DC00), rather than F0 90 80 80 (the UTF-8
form of 10000). This makes it more like UCS-2, although I suspect it would
be inaccurate to say that, as well.

Next message: Mark Davis: "Re: Java, UCS-2, and UTF"
Previous message: Pete Resnick: "Word spacing in HTML"
Maybe in reply to: Everett Anderson: "Java, UCS-2, and UTF"
Next in thread: Mark Davis: "Re: Java, UCS-2, and UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT