Re: extracting code values from PDF?

From: Eric Muller (emuller@adobe.com)
Date: Fri Oct 27 2006 - 00:59:02 CST

  • Next message: Otto Stolz: "Re: extracting code values from PDF?"

    For the page content, a PDF document primarily records glyphs and their
    positions. It can also optionally record the corresponding characters,
    using some combination of a mapping from glyphs to characters and local
    overrides. You can look at
    <http://www.udhrinunicode.org/assemblies/first_article_subset.pdf> to
    see how that can be done for a variety of writing systems. (I am aware
    that a copy-paste from that document using Acrobat results in additional
    SPACE characters; this seems to be a problem with Acrobat.)

    What a PDF consumer does with that is another story. Acrobat uses the
    character data when present, but also attempts, with more or less
    success, to squeeze it from whatever is available in the PDF. This is
    currently working relatively well for Latin text, but relatively little
    work has been done for other writing systems.

    Eric.



    This archive was generated by hypermail 2.1.5 : Fri Oct 27 2006 - 01:01:10 CST