RE: Unicode Support in Adobe Acrobat

From: Rick Cameron (Rick.Cameron@businessobjects.com)
Date: Mon May 02 2005 - 14:52:28 CDT

  • Next message: Rick McGowan: "New version of UTS #18 released"

    Hi, Eric

    Thanks for the explanation. As you say, it's unfortunate that the PDF
    spec uses misleading terminology for these concepts.

    I seem to recall that the fonts used in a PDF file are restricted in the
    number of glyphs they can have. IIRC the limit is 256. Thus, when our
    app produces PDF files it has to split a large font into several derived
    fonts, and make mappings from Unicode code points to glyph indices in
    these derived fonts.

    It would be far more convenient if it were possible to use Unicode code
    points as glyph indices.

    Has this situation changed? If not, is it likely to change in the
    future?

    Thanks

    - rick

    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    Behalf Of Eric Muller
    Sent: May 2, 2005 11:57
    Cc: Unicode List
    Subject: Re: Unicode Support in Adobe CS2

    Rick Cameron wrote:

    >Does 'most of our applications' include Acrobat? The last time I looked

    >at the PDF file format (which is a couple of years ago) it did not
    >allow text to be represented as Unicode.
    >
    >

    The thing to understand is that fundamentally, a PDF content stream (the
    name of the part that describes the content of a page) describes which
    glyph of which font is positioned where on a page. When you see in a PDF
    document "(office) Tj", it really means "display the glyph with glyph id
    0x6F of the current font at the current point, and advance the current
    point by the width of that glyph; display the glyph with glyph id 0x66
    at the current point, ..."

    It so happens that in the most common cases, the glyph with glyph id
    0x6F renders as "o", etc; it also happens that the PDF spec calls these
    glyph ids "character codes"; it also happens that the PDF spec calls the
    byte sequence of the glyph ids a "string". Hence, it is easy to be
    mislead and believe that "(office) Tj" means "render the (Unicode)
    character string 'office' at the current point." But that is not what
    PDF content streams are about. In particular, there is no opportunity
    for a PDF renderer to use an "ffi" ligature.

    The choice of capturing the glyphs, i.e. the result of layout, rather
    than the characters, i.e. the input to layout, is what makes PDF so good
    at providing fidelity (and is arguably necessary to achieve that
    fidelity).

    Besides the content stream, PDF also allows the input to layout to be
    captured. This is what the /ToUnicode entry in PDF /Font objects, the
    /AltText entry on marked content and the whole "tagged PDF" stuff is
    about. Furthermore, this input is correlated with the glyph references,
    i.e. it is possible to record that a given occurrence of the glyph with
    glyph id 0x6F of some font does render the (Unicode) character "U+006F".

    Or even that a seqence of glyphs occurrences does render a given
    (Unicode) character string. In many common cases the representation of
    that correlation is very efficient.

    So the statement about the PDF format is: Whenever *characters* are
    represented in PDFs, they can (and sometime have to) be represented
    using Unicode.

    Whether a specific PDF generator does properly record the input to
    layout along with the content stream (and their correlation ), whether
    it is even in a position to do so, is a separate issue.

    Eric.



    This archive was generated by hypermail 2.1.5 : Mon May 02 2005 - 14:54:19 CDT