RE: the HOW-TO of converting Chinese to Unicode HTML

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu May 24 2001 - 11:22:06 EDT


Augustus wrote:
> I need to know the HOW-TO of converting Chinese characters into
> Unicode HTML (in Traditional Chinese)?
>
> I am writing a web page that will retrieve Chinese sentences
> from database, then E-mail to my clients accordingly. Since some of my
> clients mail software can't interpret Chinese unless I put the Chinese
> in Unicode HTML format.
>
> My converter software can do this convertion, but since I cannot
> see it's program source code, I don't know HOW they do it. I want to
> know the logic / algorithm / method of this convertion. Do you know
> where I can learn that?

I think that this should be in the Unicode FAQ. I propose to add it as the
second Q/A in: <http://www.unicode.org/unicode/faq/conversion_mapping.html>.

This is my tentative draft. If it is too long, I have enclosed in [[[ ...
]]] parts that can optionally be dropped. Of course my 2nd-language English
will probably need some corrections.

Q: How can Unicode text be converted to a different encoding (or vice
versa)?

A: Unicode has many more characters than any previously existing encoding.
Because of this, it is almost always possible to convert text from another
encoding to Unicode, without any loss of data.

The opposite process (converting from Unicode to another encoding) is also
possible. However, if the Unicode text to be converted contains many
languages at once, it is possible that no other single encoding exists that
supports all the required languages. In this case, parts of the text data
may be lost in the conversion.

Generally speaking, there is no algorithmical correspondence between Unicode
code points and code points in other encodings. For this reason, the
conversion must make use of *mapping tables*.
[[[ This is especially true for CJK ideographs, because the set of Unicode
ideographs has been obtained by merging and re-sorting lists of ideographs
taken from a variety of sources, such as CJK encoding standards or
dictionaries. ]]]

[[[ In the case of encodings for alphabetic writing systems, however, most
encodings (including Unicode) normally store letters in their natural
alphabetical order. This distribution may be deployed to compress mapping
tables. For instance, the 256-line table for encoding ISO-8859-5
(ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-5.TXT) can be turned
into an 8-line table by using code ranges:
        00..A0 = 0000..00A0
        A1..AC = 0401..040C
        AD = 00AD
        AE..EF = 040E..044F
        F0 = 2116
        F1..FC = 0451..045C
        FD = 00FD
        FE..FF = 045E..045F ]]]

The directory ftp://ftp.unicode.org/Public/MAPPINGS/ on the Unicode
Consortium's FTP site contains many mapping tables between Unicode and other
encodings.
[[[ Other similar tables may be provided by other people or organizations.
Links to such external resources are found on the Unicode Consortiums web
site (http://www.unicode.org/unicode/onlinedat/resources.html). ]]]

The most reliable source for mapping of CJK ideographs is the UniHan
database (ftp://ftp.unicode.org/Public/UNIDATA/Unihan.txt).
[[[ The relevant fields for encoding mapping are kBigFive, kCNS1986, kGB0,
kGB1, kGB3, kGB5, kGB7, kGB8, kJis0, kJis1, kKSC0, kKSC1, kCCCII, kCNS1992,
kIBMJapan, and kXerox. ]]]

All These mapping tables are normally in the form of human-readable text
files. In order for these data to be used in software application, this
textual information must be incorporated into the program's code, or
converted into a program-readable form.
[[[ Many operating systems already include a binary version of these data,
as well as utility programs or API functions to use them. ]]]

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT