Re: minimizing size (was Re: allocation of Georgian letters)

From: Eric Muller (
Date: Sat Feb 09 2008 - 19:46:17 CST

  • Next message: James Kass: "Publishing complex scripts (was Re: minimizing ...)"

    James Kass wrote:
    > PDF has long been touted as *the* way to safely send text with the
    > assurance that the recipients will be able to display that text exactly
    > as the author intended.

    Actually, it is "final form documents", not text.
    > Without any real knowledge of the PDF format and what happens when
    > converting a file to PDF, it appears to me that it is not text which is
    > being embedded. Rather, the process is embedding glyphs.

    Glyphs is the primary construct that is needed for "final form
    documents". Glyphs are mandatory in PDFs.

    When you see something like "(the car) Tj" in a PDF content stream, the
    "the car" piece is only accidentally looking like text (of course an
    intended accident, but an accident nevertheless).
    > If a glyph
    > is mapped to a Unicode value, at least some applications can return that
    > value. But, if the glyph is not mapped to a unicode value (which is
    > normally the case with presentation forms used in complex scripts),
    > there does not seem to be any effort made to preserve the Unicode
    > string which generated the presentation form. And that's really a
    > shame.

    Actually, there are ways to include characters in additions to the
    glyphs, even when the character/glyph correspondence is not one-for-one
    (look for /ActualText in the PDF reference; /ToUnicode maps are
    conceptually optimizations of that), but whether those ways are
    exploited depend on the PDF generator. Some generators use nothing,
    other will generate only /ToUnicode (what you describe) which can
    account for only 1-to-1 character/glyph mappings, others will use the
    full apparatus.

    For example, if you take the PDFs generated for the UDHR in Unicode
    project (e.g. for a
    small comprehensive example), then except for the space problem
    mentioned earlier, I think that you can copy from Acrobat and paste in
    Notepad and get back all the text.


    This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 19:48:14 CST