Re: Help with Unicode decompiling problem

From: Ben Yenko-Martinka (ben.martinka@livingimages.com)
Date: Thu Nov 04 1999 - 11:57:23 EST


Thank you Sean, but from what I've read, while the environment is free to use
any of the encoding schemes you mention, a Java char datatype is always a
16-bit Unicode, and the various stream filter classes were provided to bridge
the gap and make this translation transparent. If I were reading bytes
directly from a diskfile using low-level operations then yes, I would have
problems, but with BreakIterator.getCharacterInstance() and streams I should be
all right.

What I'm still unsure about is what happens when I encounter "user" characters
represented by a base Unicode character plus a combining Unicode character (not
a real issue with Latin encoding I realize). Though BreakIterator will
correctly point me to the right user character boundary, how do determine its
width, whether it is one or two Unicode characters, and what is the bit storage
order -- does it do look-ahead or is the combining character stored & retrieved
first so it knows not to process the base character on its own?

--Ben

"OLeary, Sean (NJ)" wrote:

> Hello Ben,
>
> My understanding of Java characters is that the language specifies that the
> characters are Unicode but it does not specify what form of Unicode. The
> implementations are free to use UTF-8, UTF-16 or even another non-standard
> internal encoding. This may affect the output from the sample code.
>
> Sean O'Leary
> oleary@awii.com
> Automated Wagering International
> 973-594-5077



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT