Re: Unicode Support in Adobe CS2

From: Eric Muller (
Date: Mon May 02 2005 - 13:56:48 CDT

  • Next message: Rick Cameron: "RE: Unicode Support in Adobe Acrobat"

    Rick Cameron wrote:

    >Does 'most of our applications' include Acrobat? The last time I looked
    >at the PDF file format (which is a couple of years ago) it did not allow
    >text to be represented as Unicode.

    The thing to understand is that fundamentally, a PDF content stream (the
    name of the part that describes the content of a page) describes which
    glyph of which font is positioned where on a page. When you see in a PDF
    document "(office) Tj", it really means "display the glyph with glyph id
    0x6F of the current font at the current point, and advance the current
    point by the width of that glyph; display the glyph with glyph id 0x66
    at the current point, ..."

    It so happens that in the most common cases, the glyph with glyph id
    0x6F renders as "o", etc; it also happens that the PDF spec calls these
    glyph ids "character codes"; it also happens that the PDF spec calls the
    byte sequence of the glyph ids a "string". Hence, it is easy to be
    mislead and believe that "(office) Tj" means "render the (Unicode)
    character string 'office' at the current point." But that is not what
    PDF content streams are about. In particular, there is no opportunity
    for a PDF renderer to use an "ffi" ligature.

    The choice of capturing the glyphs, i.e. the result of layout, rather
    than the characters, i.e. the input to layout, is what makes PDF so good
    at providing fidelity (and is arguably necessary to achieve that fidelity).

    Besides the content stream, PDF also allows the input to layout to be
    captured. This is what the /ToUnicode entry in PDF /Font objects, the
    /AltText entry on marked content and the whole "tagged PDF" stuff is
    about. Furthermore, this input is correlated with the glyph references,
    i.e. it is possible to record that a given occurrence of the glyph with
    glyph id 0x6F of some font does render the (Unicode) character "U+006F".
    Or even that a seqence of glyphs occurrences does render a given
    (Unicode) character string. In many common cases the representation of
    that correlation is very efficient.

    So the statement about the PDF format is: Whenever *characters* are
    represented in PDFs, they can (and sometime have to) be represented
    using Unicode.

    Whether a specific PDF generator does properly record the input to
    layout along with the content stream (and their correlation ), whether
    it is even in a position to do so, is a separate issue.


    This archive was generated by hypermail 2.1.5 : Mon May 02 2005 - 13:57:40 CDT