RE: Converting between Unicode and EUC

From: Addison Phillips (AddisonP@simultrans.com)
Date: Thu Jun 25 1998 - 18:35:00 EDT


Okay, I should have been nicer and shared the first time... my reply to
Robert is reproduced below.

With regard to Stephane's questions

>>> different name for EUC(I notice that some UNIX platform are
encoding it on
>>> two bytes and some other on tree)

EUC consists of different flavors for different scripts. It is a
"multi-byte" encoding, in that the characters are of different widths.
EUC-JP is the Japanese variant and encodes some characters using one
byte (essentially ASCII), some using two bytes (JIS X208, hiragana,
katakana and common characters) and some using three (JIS X212, kanji).

EUC-TW (Taiwan) encodes some characters as one-, two- and four-bytes and
mirrors the CNS standard(s).
EUC-CN (China) encodes characters as one- or two- bytes and mirrors the
GB standard.
EUC-KR (Korean) encodes some characters as one or two bytes and mirrors
the older KSC 5601 standard.

When I say "mirrors" I mean that the code points are in the same order
with respect to one-another.

The number of bytes in each character has nothing to do with which EUC
is being used. Typically you can find out which one you're using by
typing the command set on the command line and looking at your locale in
the environment, the format of which is xx_YY-charset (where
xx=language, yy=country/region and charset is your character set).

Thanks

Addison

---[begin attachment]---
Hi Robert,

I'm replying out of band with the Unicode list because of the length of
this message.

The basic guide to working with Unicode, if you don't already have it,
is The Unicode Standard, Version 2.0 (ISBN: 0201483459). Amazon.com
ships it in 24 hours...

As for the practical aspects of writing a converter, you should look
around first and see if one has already been written and debugged for
your development environment. Many systems already have a converter that
you can recycle and there is published code here and there for many of
them that can be easily adopted. Some of us have written this converter
a few times already... ;-)

If you just want a quick and dirty program to do a one off conversion,
you may just want to build/steal a mapping table for iconv.

Some things you should know if you don't already:

1. EUC comes in "flavors" depending on the script being encoded. The
most common is EUC-JP (for use with Japanese), but there are flavors for
Chinese and Korean (and Western European).

2. You can find the character tables for conversion of the base
character sets that EUC is based on at the Unicode ftp site.
Implementation of a EUC->UC/UC->EUC converter is relatively
straightforward depending on what you're trying to do, with only a few
things that you need to bear in mind. Since EUC is really defined by JIS
X 208 and JIS X 212:1993 for Japanese you will want those tables.

3. Download and read CJK.INF by Ken Lunde or go out and buy his book
Understanding Japanese Information Processing (aka UJIP), or both. If
you live in the San Francisco Bay Area you can go look at the draft of
his updated book(s) at Computer Literacy. You can find this file at
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc
<ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf-121895> . Note
that the Unicode examples stored on this site are for 10646-1 and thus
are outdated.

4. Bear in mind that EUC doesn't encode the full range of Unicode and
thus should degrade unsupported characters gracefully. There are also
some composition and marking issues that you should bear in mind, but
this e-mail is getting long and I may be way off the mark for your needs
by now...

In any case, conversion between 10646-2 and EUC will depend on which
script you are trying to encode and on your development environment. I
have a table for EUC-JP here *somewhere* that I can put it on our ftp
site if you would like it. If you really need help/are stuck/need to
test your results, give me a call/drop me a line. Our I18N guys can help
you out with this sort of thing.

Thanks,

Addison
        ______________________________________________

        Addison Phillips
        Director, Technology
        SimulTrans, L.L.C.

        650-526-4652 (direct telephone)
        AddisonP@simultrans.com (Internet email)
http://www.simultrans.com (website)

        "22 languages. One release date."
        ______________________________________________

        -----Original Message-----
        From: Stephane Godin [SMTP:stef@icam.com]
        Sent: Thursday, June 25, 1998 1:57 PM
        To: Unicode List
        Subject: Re: Converting between Unicode and EUC

        I am also interested by a guide to convert between Unicode and
EUC but I
        did not know that there was
        different name for EUC(I notice that some UNIX platform are
encoding it on
        two bytes and some other on tree)

        For me EUC stan for Extended UNIX Charset or something like
that. What are
        the name for the different
        variation of EUC and how can I figure witch one my system is
using??

        Sory for that stupid question but I am not a UNIX person...

        Thanks

        Stephane Godin
        ICAM corporation
        stef@icam.com

        ----------
> From: John Cowan <cowan@locke.ccil.org>
> To: Unicode List <unicode@unicode.org>
> Subject: Re: Converting between Unicode and EUC
> Date: Thursday, June 25, 1998 4:11 PM
>
> Robert DiGrazia [ext 252] wrote:
>
> > Is there a guide to converting between Unicode and EUC?
>
> Which EUC?
>
> --
> John Cowan http://www.ccil.org/~cowan
cowan@ccil.org
> You tollerday donsk? N. You tolkatiff scowegian? Nn.
> You spigotty anglease? Nnn. You phonio saxo? Nnnn.
> Clear all so! 'Tis a Jute.... (Finnegans Wake
16.5)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT