Re: extracting code values from PDF?

From: Olaf Drümmer (
Date: Thu Oct 26 2006 - 02:28:21 CST

  • Next message: "Re: extracting code values from PDF?"

    Hi, wrote Thu, 26 Oct 2006 03:11:35 +0200

    >I have a PDF document including some non-roman characters I would
    >like to obtain the code element value. Is there a tool able to do that?
    >Thank you for the tip.

    Programmatically this in itself is not an easy task, and as far as I
    know there is no good freeware/opensource implementation/code readily

    PDFlib does offer a text extraction tool kit (TET) - see
    for more info that I think would do the trick.

    Acrobat's API also offers access to text on a page (the SDK is publicly

    If you just need to do it occasionally/as a user you could just use
    Acrobat (recent versions work better than older ones), select the text,
    copy it to the pasteboard and then paste it into some other app that can
    give you the unicode values.

    Also: probably next month Acrobat 8 Professional will be released by
    Adobe. It contains a component called "Preflight" which in turn does
    have two features that may be of interest here:
    - a browser for the internal structure of embedded fonts (will give you
    Unicode values for glyphs in the fonts, given they are defined/can be
    - an inventory feature that (among other things) creates tables of
    glyphs for the fonts used in the PDF, together with character IDs and
    Unicode code points (and Unicode glyph names).

    For all suggestions please keep in mind that in some cases
    - it may not be possible to establish the Unicode value
    - the Unicode value may be incorrect because the information in the PDF/
    font is incorrect

    Olaf Druemmer
    callas software (which happens to be the company having developed
    Preflight for Acrobat ;-> )

    This archive was generated by hypermail 2.1.5 : Thu Oct 26 2006 - 02:29:49 CST