Announcing CJK 2.5 (Chin/Jap/Kor for LaTeX2e)

From: Werner Lemberg (A7621GAC@helios.edvz.univie.ac.at)
Date: Thu Jan 19 1995 - 15:21:22 EST


This is the LaTeX2e style package CJK Version 2.4 (3-Jan-1995)
===============================================================

It contains the following files:

    history.txt Package history
    CJK.txt This file
    CJK.sty A LaTeX2e style file to enable CJK (Chinese/Japanese/Korean)
                 logographs (i.e. Hanzi/Kanji/Hangul) with LaTeX2e
    CJK.enc Master Encoding File
    standard.enc
    Bg5.enc
    Bg5pp.enc
    KS.enc
    utf8.enc
    pmCsmall.enc
    pmCsmpp.enc
    pmCbig.enc Encoding scheme files
    Bg5.chr
    standard.chr
    hangul.chr
    utf8.chr
    pmC.chr Character encoding files
    Bg5conv.tex preprocessor for Big 5 encoded text files
    bg5latex.bat a batch file (for DOS) to demonstrate use of Bg5conv.tex
    CNS.sty
    CNS.enc
    CNS.chr CNS encoding to be used together with a different CJK
                 encoding following Christian Wittern's CEF (Chinese Encoding
                 Framework)
    UBg5.fd
    UGBs.fd
    UGBt.fd Font definition files for Chinese (examples only!)
    UJIS.fd Font definition file for Japanese (example only!)
    Uhangul.fd Font definition file for standard Hangul fonts
    Uhanja.fd Font definition file for Hanja font (example only!)
    Uutf8.fd Font definition file for Unicode font (example only!)
    UpmC-Bg5.fd
    UpmC-GBs.fd
    UpmC-GBt.fd
    UpmC-JIS.fd
    UpmC-KS.fd Font definition files for (old) pmC-fonts
    UCNS-1.fd
    UCNS-2.fd
    UCNS-3.fd
    UCNS-4.fd
    UCNS-5.fd
    UCNS-6.fd
    UCNS-7.fd Font definition files for CNS fonts (examples only!)
    vf/*.vf
    tfm/*.tfm virtual fonts and metric files for hangul standard fonts to
                 use in combination with the font libraries lj_han and lj_han1
                 (available at the CTAN hosts)

    utils/hbf2gf.w CWEB source file for hbf2gf
    utils/hbf2gf.c C code file extracted from the CWEB source files
    utils/hbf2gf.dvi Documentation extracted from the CWEB source files
    utils/hbf2gf.cfg Configuration file example
    utils/hbf2gf.exe Bound executable for DOS and OS/2
    utils/hbf.h
    utils/hbf.c Ross Paterson's HBF API (with small extensions)
    utils/Makefile Makefile for hbf2gf
    utils/emx.exe
    utils/emx.dll
    utils/rsx.exe Runtime binaries for DOS and OS/2 (must be in
                            the path)

This is freely distributable under the GNU Public License.

Use

---

Use CJK.sty as a package, e.g.

\documentclass{article} \usepackage{CJK} .

Two new environments \begin{CJK}{encoding}{shape} ... \end{CJK} and \begin{CJK*}{encoding}{shape} ... \end{CJK*} are defined:

encoding the following encodings are currently implemented in CJK.enc (for CNS encoding see below):

Bg5 (Big 5) GBs (GuoBiao with simplified characters, G1 = GB 2312-80) GBt (GuoBiao with traditional characters, G1 = GB 12345-90) JIS (Japanese Industry Standard, G1 = JIS X0208-1990) KS (hangul and hanja, G1 = KSC 5601-1987) utf8 (UTF 8 (Unicode Transformation format 8), also called UTF 2 or FSS-UTF)

The encodings (except Big 5 and UTF 8) are simplified EUC (Extended UNIX Code) character sets without single shifts. The character set slot G1 stands for two byte encodings with byte values taken from the GR (Graphic Right) character range 0xA1-0xFE (as defined in ISO 2022).

For compatibility with the pmC package these additional encodings are defined: pmC-Bg5, pmC-GBs, pmC-GBt, pmC-JIS, and pmC-KS. It's not encouraged to use these encodings because of wasting fonts. If possible, convert your original CJK-bitmaps with hbf2gf (see below) to CJK-encodings.

shape It is impossible to know what fonts are available at your site; look at the example .fd-files how to create appropriate .fd-files suiting your needs. If you use the KS environment, this parameter is unused (see below).

The CJK* environment will swallow unprotected spaces and newlines after a CJK character, whereas CJK will not.

This is a very realistic example:

\begin{CJK*}{GBs}{kai} ... Text in GuoBiao encoding ... \end{CJK*}

How it works ------------

Asiatic logographs can't be represented with one byte per character. (At least) two bytes are needed, and the most common encoding schemes (GB, Big 5, JIS, KS etc.) have a certain range for the first byte (usually 0xA1-OxFE or a part of it) which signales that this and the next byte represents an Asiatic logograph. This means that plain ASCII-text (i.e. characters between 0x00 and 0x7F) will be left undisturbed, and most characters of the extended ASCII character set (0x80-0xFF) will be assigned to a CJK encoding.

Due to the internal architecture of TeX it is impossible to support ISO 2022 escape sequences as used with MULE (Multi Language Emacs). MULE is a common extension of GNU-Emacs to support many non-English scripts, including Chinese, Japanese, Korean, Hebrew etc.

CJK.sty will make the characters 0xA1-OxFE active inside of an environment and assigns the macros \CJK@char and \CJK@charx to the active characters which select the proper font. The real mechanism is a bit more complicated to assure robustness (it was borrowed and modified from german.sty) and correct handling of punctuation characters.

The encodings -------------

CJK.sty defines internally \CJK@standardEncoding, \CJK@Bg5Encoding, \CJK@KSEncoding, \CJK@utf8Encoding, and for compatibility with pmC, \CJK@pmCsmallEncoding and \CJK@pmCbigEncoding.

\CJK@standardEncoding will be used for encodings with the second byte in the range 0xA0-0xFE (GB, JIS).

\CJK@Bg5Encoding will be used for Big 5 encoding (e.g. NTU fonts) with the second byte in the range 0x40-0xFE.

\CJK@KSEncoding will be used for KS encoding. Two sets of subfonts are defined, one for Hangul syllables and elements, and a second for Hanja. For more details see below.

\CJK@utf8Encoding will be used for Unicode in UTF 8. The first byte is in the range 0xC0-0xDF for two byte values and in the range 0xE0-0xEF for three byte values. The other bytes are in the range 0x80-0xBF. Note that CJK expects two hexadecimal digits as a running number in the font name instead of two decimal digits. Use the option `unicode on' if you use hbf2gf to transform bitmap fonts in HBF format to .pk fonts as used by CJK.sty .

\CJK@pmCsmallEncoding and \CJK@pmCbigEncoding can be activated with \pmCsmall (this is the default) and \pmCbig inside the CJK environment. Note that the original pmC fonts have two character sizes per font (the bigger ones with an offset of -128); pmC-Big 5 encoded fonts cannot contain big characters. The names of the fonts in the UpmC-xxx.fd files reflect the modifications added by Marc Leisher <mleisher@nmsu.edu> to the original poor man's Chinese (pmC) package written by Thomas Ridgeway <ridgeway@blackbox.hacc.washington.edu>.

The fonts ---------

CJK.sty uses NFSS (New Font Selection Scheme, now part of LaTeX2e) which has some advantages over the font selection offered with pmC (for plain TeX and LaTeX 2.09):

o TeX fonts are loaded only on demand. This is especially useful with Asiatic logographs. If you have e.g. three Chinese characters in your text, pmC must load the whole Chinese font (about 85 TeX fonts), whereas LaTeX2e loads only three fonts normally.

o As long as the limit of 256 TeX fonts will not be exceeded, you can use as many CJK fonts as you like (e.g. simplified and traditional Chinese characters together with Japanese fonts in different sizes) --- pmC is limited to two sizes and can only have two CJK fonts at the same time.

In the web2c-TeX package (for UNIX) you will find a patch which allows the use of more than 256 TeX fonts.

o You need not to care about the right size of CJK fonts in footnotes etc. They will obey the NFSS (although changing other attributes except font series and size will be done with \CJKenc and \CJKshape).

For Hangul font selection see below.

Of course you must have access to bitmap CJK fonts --- use hbf2gf to convert them to .pk-fonts. See the last section for availability of precompiled fonts.

If you chose one font per active character as with the pmC macros, you would waste character space (256 characters per font are possible with TeX 3). Therefore CJK.sty expects the whole Asiatic font splitted in TeX subfonts with 256 characters each.

An example:

GuoBiao-encoded simplified characters in song style at 12pt: ^ ^ ^^ ^^

first byte second byte TeX font offset ---------------------------------------------- 0xA1 0xA1-OxFE gsso1201 0 0xA2 0xA1-0xFE gsso1201 94 0xA3 0xA1-0xE4 gsso1201 188 0xA3 0xE5-0xFE gsso1202 0 0xA4 0xA1-0xFE gsso1202 26 0xA5 0xA1-0xFE gsso1202 120 . . . 0xFE 0xA1-OxFE gsso1235 38

For converting to .pk-files with hbf2gf, you must get the appropriate HBF (Hanzi Bitmap Font) header files from ifcss.org (or create if you can't find the right one); almost all Chinese bitmap fonts in the public domain together with their HBF headers are collected there. These HBF files document CJK fonts completely.

Using hbf2gf -------------

hbf2gf converts CJK bitmaps with an HBF header file into .gf-files (and consequently into .pk fonts).

Syntax:

hbf2gf configuration_file

Keywords in the configuration file must start a line, the appropriate values being on the same line separated with one or more blanks or tabs.

Here is an example configuration file jfs56.cfg (please refer to hbf2gf.dvi for a description of the keywords):

hbf_header jfs56.hbf mag_x 1.482 x_offset 3 y_offset -8 comment jianti fansongti 56x56 pixel font magnified and adapted for 10pt

nmb_files -1

output_name gsfs10

checksum 123456789

dpi_x 600 dpi_y 600

coding codingscheme GB 2312-80 encoded TeX text

pk_directory d:\china\pixel.ljh\600dpi\ tfm_directory d:\china\tfm\

rm_command del cp_command copy long_extension off job_extension .cmd

And here the results:

input files: jfs56.a - jfs56.e, jfs56.hbf

program call: hbf2gf jfs56.cfg

intermediate files: gsfs10.cmd, gsfs1001.gf - gsfs1032.gf, gsfs10.pl

batch file call: gsfs10.cmd

output files: d:\china\pixel.ljh\600dpi\gsfs1001.pk - gsfs1032.pk, d:\china\tfm\gsfs1001.tfm - gsfs1032.tfm

[gsfs: GuoBiao simple encoded FanSong style ^ ^ ^ ^ It's hard to overcome the DOS restriction of 8 characters in a file name if you need two characters as a running number...]

This would be a correct entry in UGBs.fd:

\DeclareFontShape{U}{GBs}{m}{fansong}{ <-10> CJKfixed * gsfs10 <10> sCJKfixed * gsfs10 <10.95> sCJKfixed * gsfs12 <12> sCJKfixed * gsfs12 <14.4> sCJKfixed * gsfs14 <17.28> sCJKfixed * gsfs17 <20.74-> CJKfixed * gsfs17}{}

assuming that you have created fonts for 10, 12, 14.4, and 17.28pt.

Korean input ------------

(The status of this feature is experimental. I can't speak Korean and would be glad to hear comments from people who have any idea what is happening here :-)

There are already different packages handling Hangul: hlatex, htex etc.; there is one package which also can handle hanja: jhtex.

The great difference of the packages just mentioned compared to CJK is the use of a preprocessor which converts text files containing KS encoded text into a TeX file. To do so has some advantages, but the output is completely unreadable. Additionally the output lines become rather lengthy (a two byte character code will be converted into a string up to 11 characters long), which may confuse some editors; and if you have a text which contains Chinese or Japanese also, you can't use KS to TeX converters because the code ranges overlap and converters are not able to recognize which is Korean and which is not.

In contrast, CJK does not need a preprocessor and the problems mentioned above are nonexistent, but you get nothing for free: CJK uses the virtual font mechanism to map the hangul syllables onto Hangul Elements (11 virtual fonts map to 2 real fonts), whereas preprocessors directly use the real fonts.

If you want a complete Korean environment, I recommend jhtex. There you will also find a hangul.sty which modifies (among others) the sectioning commands to enable Korean chapter counting and Korean headers.

To use KS encoding, say

\begin{CJK}{KS}{} ... \end{CJK} .

These font switches are available inside the environment:

fonts from hLaTeX:

* \mj MyoungJo (default) \gt Gothic \gs BootGulssi \gr Graphic \dr Dinaru

fonts from jhTeX:

* \hgt Hangul Gothic * \hmj Hangul MyoungJo (MunHwaBu fonts) * \hpg Hangul Pilgi \hol Hangul Outline (MyoungJo)

If a font is marked with a star, bold series are available.

You will find the hangul fonts in the lj_han and lj_han1 packages. These are emTeX libraries for 300 dpi resolution which can be easily converted back to .pk fonts using the fontlib package of emTeX. If you need different resolutions, you must obtain the original metafont sources of the hlatex_mf.tar.gz and the jhtex packages. Note that the shapes of Hangul elements are not satisfactory.

You find the needed virtual fonts and virtual metric files in the vf and tfm directories. Move the .tfm files into a directory TeX will scan. You need a dvi driver which understands virtual fonts -- move the .vf files into a directory your dvi driver will scan.

For non-hangul characters inside the KS environment (i.e. the first byte in the ranges 0xA1-0xAF except 0xA4 and 0xC9-0xFD), fonts are taken from Uhanja.fd . This enables the use of many hangul fonts and perhaps only one or two different hanja fonts. If you don't want the overlay of hangul fonts from Uhangul.fd, say \CJKhanja. The opposite command is \CJKhangul.

Archaic hangul elements (KS 0xA4D5-0xA4FE) and the character KS 0xA4D4 are only accessible if \CJKhanja is active.

You should convert your KS hanja fonts using hbf2gf as described above.

Bg5conv.tex -----------

Using the Bg5text environment is a mess. Having an external preprocessor needs access to a compiler, which is not always the case. Thus I wrote Bg5conv.tex, a preprocessor for Big 5 characters to overcome the restrictions of the Bg5text environment.

Each Big 5 character `XY' will be converted into the form `XZZZ.'; ZZZ is a decimal number followed by a dot. The use of Bg5conv.tex is completely transparent, no changes to your document are necessary.

The use is simple: before calling Bg5text you must define \CJKin (and optionally \CJKout); after conversion the output file will be processed like a normal input file. Bg5conv.tex inserts additionally the (empty) macro \CJKpreproc as the first line of the output.

Here is an example batch file (bg5latex.bat) for DOS which demonstrates the use of Bg5conv.tex . Note that you must not use an extension for the input file here (I am too lazy to write a sophisticated shell program - any volunteers are welcome) (default names for \CJKin and \CJKout are `Bg5input.tex' and `Bg5input.cjk' respectively):

call latex \def\CJKin{%1} \def\CJKout{%1.cjk} \input Bg5conv.tex call latex %1.cjk

You say

bg5latex mytext

to get mytext.tex processed.

It's not possible to mix Big 5 encoding with different encodings (except CNS) if Bg5conv.tex is used (and I doubt whether this should be ever necessary).

CNS.sty -------

(The status of this feature is experimental.)

Christian Wittern <g53150@sakura.kudpc.kyoto-u.ac.jp> develops CEF, the Chinese Encoding Framework. This will enable the use of Big 5 as the primary encoding with CNS 11643-1992 as a secondary character set for characters not included in Big 5. Inputting CNS characters into a text will be done with a data base. To facilitate this, the first bytes of the three byte CNS encoding are mapped onto the characters 0x81-0x87.

Say

\usepackage[options]{CNS}

to use CNS (CJK.sty will be loaded automatically). If you need to specify options for CJK, say

\usepackage[options]{CJK} \usepackage[options]{CNS}

The possible options for CNS.sty are `compressed' and `uncompressed' to indicate the use of compressed (256 characters per font a la CJK.sty) or uncompressed fonts (94 characters per plane as in pmC). Default is compressed.

CNS encoding is available only in CJK environments; the commands \CNSchar (of course with three parameters for byte 1 to 3) and \CNSshape are similar to their CJK counterparts. Default value of \CNSshape is `song'.

Uncompressed fonts should be named equal to pmC fonts (font names ending with hex numbers).

The .fd-files -------------

CJK fonts can be installed as easy as normal TeX fonts!

CJK.sty defines four new size commands:

CJK corresponds to `' (empty) sCJK corresponds to `s' CJKfixed corresponds to `fixed' sCJKfixed corresponds to `sfixed' .

The difference between these size functions and the original commands defined by LaTeX2e is that a CJK size function defines a class of fonts.

If you say as an example

\DeclareFontShape{U}{Bg5}{m}{song}{<6> <7> <8> sCJKfixed * b5so07}{} ,

LaTeX2e searches for fonts named b5so0701 - b5so0758 if the font size is 6, 7 or 8 pt; with other words, the CJK size functions append two digits to select the proper subfonts. These digits are defined in the \CJK@...Encoding macros; the macro \CJK@plane holds the current value (in pmC compatibility mode, \CJK@plane holds hexadecimal numbers).

See the example .fd files how to define font substitutions additionally.

Caveats -------

o You can of course use CJK-environments inside of a CJK-environment, but it is possible that you must increase the so called save size (with emTeX you can adjust this with -ms=...).

The CJK package has optional arguments which controls the scope of CJK environments:

lowercase If you want to use \lowercase with encodings inside CJK environments. You need less save size using the `encapsulated' option if `lowercase' is not set. You must use Bg5conv.tex to use Big 5 characters with this option.

global \lccode (if `lowercase' set), \uccode, \catcode and the activation of the characters 0xA1-0xFE will be globally modified (\lccode and \uccode reset to 0). This is the most economical mode concerning save size, but you can't have CJK environments inside of CJK environments or other environments which manipulate the character range 0xA1-0xFE.

Packages which change some of the above values only once (e.g. in the preamble) will also not work after the first use of a CJK environment.

local Only \lccode (if `lowercase' set) and \uccode will be modified globally. This is the default. You can stack environments.

encapsulated If you want to use DC fonts outside of the CJK environment with \uppercase and \lowercase working correctly, you must use this option. All values mentioned above will be local, so you can stack environments. This option probably causes an overflow of the save size.

Say

\usepackage[option]{CJK}

to activate `option'.

o There is an other way to overcome the problem of stacked environments. CJK implements two low level CJK attribute switches: \CJKenc and \CJKshape, which take the same arguments as the corresponding values of the CJK environment. If you need two different encodings/shapes at the same output line, you must use these macros. An example:

\begin{CJK}{GBs}{song} ... Text in GBs song ... \CJKenc{GBt} ... Text in GBt song ... \CJKshape{kai} ... Text in GBt kai ... \end{CJK}

Contrary to \begin{CJK}{...}{...} it's not necessary to start a new line after \CJKenc.

o The characters \, {, and } are used as second bytes in the Big 5 encoding. If you write Big 5 text mixed with other encodings, you should use the Bg5text environment which changes the category codes of these characters. The command prefix is now the forward slash `/', and the grouping characters are `(' and `)' respectively.

An example:

\begin{CJK}{Bg5}{song} \begin{Bg5text} .... /begin(center) .... /end(center) .... /end(Bg5text) \end{CJK}

To get the `/', `(', and `)' characters, write `//', `/(', and `/)' inside the Bg5text environment.

This environment is ugly, and some commands like \newcommand will not work in it.

o Instead of using the Bg5text environment you can protect the offending second bytes with a backslash, i.e. \{, \}, \\ (using a non- Chinese editor). This will not increase the readability of the Chinese text, but for short texts it's perhaps more comfortable. Alas, it doesn't work in page header commands because the macros \{ etc. will not be expanded.

o Be careful not to use any commands inside the Bg5text environment which write something into an external file (commands like \chapter etc.).

o If it's not possible to avoid Big 5 character codes with \, {, or } outside of the Bg5text environment (e.g. having Big 5-text in a \chapter or \section command), you can replace them with the \CJKchar macro manually:

\section{This is a problematic Big 5 character: \CJKchar{169}{92}}

The parameters are the first and second byte of the Big 5 character code. You can also use hexadecimal or octal notation.

o A similar command is \Unicode{byte1}{byte2} to access Unicode characters (not in UTF 8) directly; the parameters are the first and second byte of the Unicode.

o CJK will disable \uppercase (preserving the command as \CJKuppercase) if you select Big 5 encoding without using Bg5conv.tex . This affects the headers of the standard classes and \Roman only in standard LaTeX2e. Be aware that some packages and style files may use \uppercase for dirty tricks (e.g. to define macros for active characters).

o \uppercase and \lowercase will work with NONE of the CJK encoding schemes if you use DC fonts because these 8-bit fonts have most \lccode's and \uccode's set in the range 0x80-0xFF.

o You should define for each TeX font size a CJK font (as an example, use sCJKfixed for good sizes and CJKfixed for bad sizes, and LaTeX2e will complain loudly about wrong sizes on the screen).

LaTeX2e will also do the job if some size definitions are missing (using defined sizes), but expect a font warning for each (!) CJK character affected under certain circumstances.

Possible errors ---------------

o If you write Chinese (or Japanese) text, don't forget to suppress the linefeed character with a trailing `%' in the CJK environment, otherwise you get unwanted spaces in the output. On the other side, say `\ ' or something similar inside the CJK* environment to get a space after a CJK character.

o To prevent a line break before a CJK character (e.g. between an opening (non-CJK) parenthesis and a CJK character), say \CJKkern. This command prevents the insertion of \CJKglue before the CJK character.

You may wonder about the curious name: a small kern (1 sp) between two CJK characters signales that the first one is a punctuation character.

o If you get the error message: "\CJK@min (or \CJK@max) undefined", you should insert \newpage before saying \end{CJK}. This can happen if LaTeX writes the headers (or footers) of a page containing CJK characters after closing the CJK environment.

o If you get overfull hboxes caused by CJK characters, try to increase \CJKglue. It defines the glue between CJK characters; the default definition is

\newcommand{\CJKglue}{\hskip 0pt plus 0.08\baselineskip} .

\CJKglue will be inserted by CJK before each Chinese character (except punctuation characters as defined in the punctuation tables; see CJK.enc), and none after. You should separate non-Chinese text from CJK characters with spaces to enable hyphenation.

o If you get overfull hboxes caused by Hangul syllables, try to increase \CJKtolerance. The default definition is

\newcommand{\CJKtolerance}{400} .

o If you encounter a TeX stack overflow caused by {\CJKenc{new_encoding} ....}, you should write

\CJKenc{new_encoding} ... \CJKenc{old_encoding}

instead. Or (better) increase the stack size as discussed above.

How to get CJK and related software -----------------------------------

o You will find CJK and software related to TeX at the CTAN hosts (Comprehensive TeX Archive Network). These completely identical ftp servers (concerning TeX software) are

ftp.shsu.edu Sam Houston University Texas (USA) ftp.dante.de DANTE (Deutsche Anwendervereinigung fuer TeX) Heidelberg (Germany) ftp.tex.ac.uk Cambridge University Cambridge (England)

You should use the nearest one, or even better, a local mirror of a CTAN host.

CJK will be found unpacked. To receive the complete package, go to the parent directory of CJK and say

get CJK.zip or (whichever is appropriate for your system) get CJK.tar.gz

The CJK directory and all subdirectories will be sent to you in compressed form. Be aware that not all mirrors of CTAN sites support compression of directories.

o The main site for Chinese related software is ifcss.org (USA). Mirrors are ftp.edu.tw (Taiwan), cnd.org (USA) and kth.se (Sweden). Here you find free Chinese fonts, Text editors etc.

Note that while updating this text (3-Jan-1994) ifcss.org has still stopped ftp access due to networking problems.

o The main site for Korean related software is cair-archive.kaist.ac.kr (Korea). I don't know any mirror sites of this host. At ifcss.org you will find a 24x24 hanja font with HBF header in /software/fonts/misc/hbf.

o Sam Chiu <ccc11@cus.cam.ac.uk> compiled the fonts jfs56 (GBs encoded) and ntu_kai48 (Big 5 encoded) for various sizes with 600dpi resolution. You will find them (about 22 MByte uncompressed!) at the CTAN hosts in /tex-archive/fonts/chinese

Author ------

Werner Lemberg <a7621gac@awiuni11.bitnet>

Goldschlagstr. 52/14 A-1150 Vienna Austria/Europe

Please report any errors or suggestions to this email-address.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT