Re: Question on Unicode data files

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Feb 26 2001 - 15:28:35 EST


Marco asked:

>
> The Unicode FTP site (ftp://ftp.unicode.org/Public, now temporarily remapped
> on http://www.unicode.org/Public) contains several files with mappings of
> East Asian character sets to/from Unicode.
>
> Are all these sources in sync? If not, which ones is it better to trust?
>
> - UNIDATA/CJKXREF.TXT (containing Big-5, CCCII-1, CNS-1, CNS-2, CNS-E,

Actually: Unihan.txt, as Marco pointed out in his correction.

> EACC=ANSI-Z39-64-89, GB-0=2312-80, GB-1=12345-90, GB-3=7589-87,
> GB-5=7590-87, GB-7=GUCfMC, GB-8=8565-89, JIS-0=X-0208-90, JIS-1=X-0212-90,
> JIS-IBM, KS-C-0=5601-87, KS-C-1=5657-1991, KSC-IBM, Xerox)
>
> - MAPPINGS/EASTASIA/EASTASIA/CJKXREF.TXT (containing same mappings as above)
>
> - MAPPINGS/EASTASIA/EASTASIA/UNIHAN.TXT

The two files in MAPPINGS/EASTASIA/ are old and out-of-date.
MAPPINGS/EASTASIA/UNIHAN.TXT is identical to Unihan-2.txt, which
can be found under /Public/2.1-Update/ The CJKXREF.TXT is even
older.

The current Unihan file is:

/Public/UNIDATA/Unihan.txt

That is the same as /Public/3.0-Update/Unihan-3.txt.

The Unihan file currently under beta review for Unicode 3.1 can
be found in the beta directory:

/Public/3.1-Update/

It will be renamed to Unihan-3.1.txt when the beta period is done, and
will then also appear in the UNIDATA directory as Unihan.txt.

Everything else under /Public/MAPPINGS/EASTASIA/ constitute
mappings tables to particular code pages, and many of those
are also somewhat out of date.

>
> - MAPPINGS/EASTASIA/EASTASIA/GB/GB12345.TXT
> - MAPPINGS/EASTASIA/EASTASIA/GB/GB2312.TXT

> Moreover, directory UNIDATA contains <UnicodeData.txt> and
> <UnicodeData-Latest.txt>. They seem to always be identical (same date &
> time, same size).
>
> Which one of them is the official Unicode database, and what is the other
> one for?

UnicodeData.txt is the official version. UnicodeData-Latest.txt is a
duplicate placed there because of earlier policy, just in case anyone
still had links pointing to "UnicodeData-Latest.txt" for the current
version, instead of "UnicodeData.txt", so their links would not break.

To find out about *official* versions of data files, always start from
the standard page:

http://www.unicode.org/unicode/standard/standard.html

and follow the link to the "Enumerated Versions" page:

http://www.unicode.org/unicode/standard/versions/enumeratedversions.html

That page always gives you the links to the latest data files, and
to the data files for each specific version of the standard. The
MAPPINGS directory is all informative, and is not a part of the
official Unicode Character Database at this time.

--Ken

>
> Thanks.
> _ Marco
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT