Re: Acrobat, Unicode, Advanced usage

From: Eric Muller (emuller@adobe.com)
Date: Tue Jul 09 2002 - 15:04:29 EDT


Greenwood, Timothy wrote:

>This question is pertinent to one asked me the other day for which I did not have an answer. Is the code set of an original document relevant for PDF - say EUC, SJIS, PDF - will the output perform text searches correctly for differing code set inputs?
>
PDF documents logically contain two streams: one of characters, and one
of glyphs.

The glyph stream is always present physically, and is used for
rendering. Depending on the fonts involved, the PDF generator, and all
sorts of factors, the meaning of the numbers in that glyph stream, and
the machinery to locate the actual outlines will vary quite a bit.

The character stream can be represented explicitly, in which case I am
pretty sure it is always a Unicode stream. Alternatively, it can be
computed from the glyph stream using various mechanisms; I believe that
all the computations described in the PDF spec generate a Unicode stream.

The choice of explicit vs implicit character representation is up to the
PDF producer. In all cases, I believe that the producer has the
responsibility of converting from whatever character standard is used in
the original document to Unicode. When the producer is Distiller, it may
not have access to the original character content and be forced to
create an approximation.

Eric.



This archive was generated by hypermail 2.1.2 : Tue Jul 09 2002 - 13:32:00 EDT