Re: PostScript and Unicode

From: Jim DeLaHunt (delahunt@adobe.com)
Date: Fri Dec 18 1998 - 15:06:17 EST


Kevin:

I'd like to respond to your "two queries regarding PostScript". Sorry I'm a
bit delayed in doing so. It turns out to be a long answer. Perhaps even
more than you were asking for!

At 05:14 AM 12/7/98 -0800, Kevin Bracey wrote:
>Two queries regarding PostScript:
>
>1) Could anyone working at/with Adobe enlighten me on the position of
>PostScript with regard to non-BMP UCS characters? Looking at the Unicode and
>Glyph Names document at
>
> http://www.adobe.com/supportservice/devrelations/typeforum/unicodegn.html
>
>there is a firm 16-bit assumption: "g is of the form uni<CODE> (where <CODE>
>is a 4-digit uppercase hexadecimal number)".

First, I'm happy to say that this document has been updated; check out the
URL above for the new version.

Second, before you go any further, let's stop and place this issue with
respect to the well-known character-glyph rendering model. PostScript is
in the realm of final-form documents; it deals in glyph images and
presentation forms, not characters or coded characters in the Unicode sense.

Thus the most basic position of PostScript is: render your characters into
glyphs. Use some appropriate names or numbers to refer to glyph images, and
build fonts based on these names/numbers and glyph images. Send both fonts
and a page description that uses these fonts to PostScript interpreter.
Page appears. You can make nearly arbitrary choices of glyph images, and
names or numbers to refer to them.

Now, if you don't want to create your own glyph images and fonts, but reuse
others, then you are asking about how Adobe and others have implemented
PostScript fonts and what names or numbers we use. You then also need to
specify which character-glyph rendering process you are referring to. In
particular, is your app doing the character-glyph rendering, or are you
relying on some system component to do it? Or, do you expect to provide
character codes to the PostScript environment, and have *it* do the
character-glyph mapping?

Returning to your question,
>...Looking at the Unicode and Glyph Names document ...
>there is a firm 16-bit assumption: "g is of the form uni<CODE> (where
><CODE> is a 4-digit uppercase hexadecimal number)".

Using the vocabulary common in the Unicode community, we generate names for
"presentation forms" by using the <CODE> of the corresponding Unicode "coded
character". Of course this model bends a little when the mapping between
coded character and presentation form is not 1-1.

The revised document gives the straightforward answer: add names of the
form "uni<CODE1><CODE2>".
<http://www.adobe.com/supportservice/devrelations/typeforum/unicodegn.html#2
.c.i>

>Our system uses a proprietary font format, but uses encoding files
>containing PostScript glyph names. The system is fully capable of handling
>the 31-bit UCS space, and as such will require a way of specifying
>characters outside the BMP as a PostScript glyph name.
>
>Are there plans to extend this specification to cover non-BMP characters?

Maybe the straightforward answer above answers your question. But given
your second question, maybe the following will be helpful.

Any PostScript font encoding a Far East language such as Japanese or
Chinese will probably use an OCF font or a CID-Keyed font, where this
document does not apply. Let me review how OCF and CID font formats relate
to this question. I apologies if you know all this already. It may be new
to some other readers.

OCF fonts are an old format for Japanese fonts which Adobe no longer
supports, but OCF fonts are still found in the CJK market. OCF fonts are
tightly bound to PC industry character sets, and you refer to a glyph by
its character code.

In a CID-Keyed font, you refer to characters ("presentation forms" in the
Unicode sense) by an integer. The semantics of these integers are defined
by a "Registry", "ordering" and "supplement" -- jointly they define a
mapping from integers to presentation forms. Adobe has defined several
Registry-Ordering-Supplement mappings for presentation forms corresponding
to the BMP, e.g. Adobe-Japan1-2, which lists the presentation forms needed
to print common Japanese PC industry character sets.

A related part of the CID-Keyed font architecture is the mapping from
character codes to CID integers. This mapping is implemented by a file
known as a "CMap" (not to be confused with the OpenType or TrueType concept
of a "cmap"). Adobe has defined CMap files to map from common character
sets to Adobe-defined Registry, Ordering, and Supplement definitions. For
instance, the CMap "RKJSJ-H" maps from the Shift JIS character code to the
Adobe-Japan1-2 presentation form codes.

This is a deep topic, and this summary doesn't do it justice. For more
information on CID-Keyed fonts and CMap files, please see:
  <http://www.adobe.com/supportservice/devrelations/typeforum/cidfonts.html>

>2) When it comes to actually printing, what is the best source of
>information on how to send Unicode text to a PostScript printer? Is this
>actually possible yet? Can I specify an encoding vector in terms of
>/uni4E00 etc on a standard Japanese printer?

Firstly, the PostScript page description you send to a printer is a stream
of octets. It contains operators expressed as ASCII-encoded characters or
PostScript-specific binary tokens. It is highly unlikely that there will
ever be a PostScript printer that accepts a page description consisting of
UTF-16 codes.

However, that octet stream can certainly accept strings of glyph codes in
just about any desired encoding -- including Unicode. Adobe has developed
CMap files which map from Unicode to the Adobe-defined
Registry-Ordering-Supplement specifications.

However, remember that PostScript page descriptions are in the realm of
glyph images and presentation forms, not coded characters. By using *just*
Unicode code points in PostScript page descriptions, you are limiting
yourself to a very limited character-glyph rendering process. Consider:
 * It should be the application, not the PostScript interpreter, which
   performs font selection. Font selection is an important part of the
   character-glyph rendering process, affecting line breaks and which
   presentation forms appear. If the application is selecting fonts, can't
   it also influence the generation of glyph codes?
 * The application should also provide a way to select alternate glyph
   images where a font provides multiple versions. For instance, a font may
   have an "fi" ligature, which can be used in place of the "f" and "i"
   glyph images. Or, you may want to use the traditional form of a Japanese
   glyph instead of the simplified form. Adobe's Ken Lunde made a
   presentation about this at the 13th International Unicode Conference; his
   slides are at:
     <http://www.unicode.org/unicode/iuc/iuc13/a10/slides.pdf>

By simplifying your question, the answer becomes a little more specific and
perhaps useful.

>2) When it comes to actually printing, what is the best source of
>information on how to send Unicode text to a PostScript printer? ...

The Adobe Developers Association and our Type Forum
  <http://www.adobe.com/supportservice/devrelations/>
  <http://www.adobe.com/supportservice/devrelations/typeforum/main.html>

>... Is (sending Unicode text to a PostScript printer) actually possible yet?

No, not in a way that lives up to the challenge of the Unicode
character-glyph rendering model. Nor should it be. The PostScript
interpreter is the wrong place to do this rendering.

However, if your application performs character-glyph rendering and you
want to emit glyph codes font selections and glyph codes for your own
fonts, I'd recommend using the CID font architecture. Define a Registry,
Ordering and Supplement which defines your own presentation form numbers.
Make this scheme as similar to UTF-16 as you like. Build your fonts in CID
format according to this spec. Make an identity that maps these numbers
1-1. When your app generates a PostScript language page description, have
it generate strings containing these glyph codes, and font references that
invoke your fonts and identity CMaps.

>...Can I specify an encoding vector in terms of /uni4E00 etc on a
>standard Japanese printer?

If the font is in a form that uses glyph names, sure. Those forms are: Type 1
fonts, OpenType fonts with non-CID-keyed CFF data, and TrueType fonts and
OpenType fonts with TrueType data that contain 'post' tables with implicit
or explicit glyph names. But, there are thousands of glyphs where /uni4E00
came from. Wouldn't you be better off using a CID structure for your
proprietary fonts?

Alternately, if you want to use existing CJK fonts, but your application
still performs character-glyph rendering, then still define your own glyph
code, but base it on the existing Adobe Registry-Ordering-Supplement specs,
and build your own CMap file to map from your glyph codes to the Adobe CID
codes.

I hope these comments are helpful.

>--
>Kevin Bracey, Senior Software Engineer
>Acorn Computers Ltd Tel: +44 (0) 1223 725228
>Acorn House, 645 Newmarket Road Fax: +44 (0) 1223 725328
>Cambridge, CB5 8PB, United Kingdom WWW: http://www.acorn.co.uk/

        --Jim DeLaHunt, Engineering Manager
         Adobe Type Library, Adobe Systems Incorporated
         M/S W-08, 345 Park Ave, San Jose, CA 95110-2702
          email: delahunt@adobe.com, tel: +1-408-536-2690



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT