RE: japanese xml

From: Misha.Wolf@reuters.com
Date: Thu Aug 30 2001 - 05:39:02 EDT


On 30/08/2001 09:16:21 Marco Cimarosti wrote:
> Viranga Ratnaike wrote:
> > Is it ok for Unicode code points to be
> > encoded/serialized using EUC?
> > I'm not planning on doing this; just wondering what (?if any?)
> > restrictions, there are on choice of transformation format.
>
> EUC size simply doesn't fit Unicode.
>
> Each EUC-encoded character is either a single byte or a sequence of *two*
> bytes. Each byte in a double-byte character is a non-ASCII code (range
> 128..255). So, even assuming that the whole range 128..255 is assigned to
> double byte encoding, EUC allows a maximum of only 16,384 characters (128 x
> 128). But Unicode has 1,114,112 code points...

That is, IMO, quite a misleading reply. It would be more helpful to say
something like:

Yes, it is OK for Unicode code points to be encoded using EUC. Keep in
mind, though, that the EUC character repertoire is a lot smaller than
the Unicode character repertoire. Consequently, many Unicode characters
cannot be directly encoded using EUC. Of course, EUC (EUC-JP in the
case of Japanese) may cover all the characters you require, in which
case there is no problem. Additionally, if you are thinking of XML (or
HTML) then you can encode *all* Unicode characters in an EUC-encoded
document, by employing numeric character references for characters
outside the EUC character repertoire. Using the same technique, you can
encode all Unicode characters in an ASCII-encoded document.

Misha

-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.



This archive was generated by hypermail 2.1.2 : Thu Aug 30 2001 - 06:48:58 EDT