Re: extracting code values from PDF?

From: vunzndi@vfemail.net
Date: Thu Oct 26 2006 - 03:45:46 CST

  • Next message: Jukka K. Korpela: "Re: extracting code values from PDF?"

    I agree with Olaf, the way to do it is to change the pdf into text. Whether
    that is possible or not depends on how the pdf was made -- sometimes things are
    saved in pdf as images not as characters then the only way is by OCR.

    There are many ways to get from the test to the code

    How many page are you dealing with?

    John Knightley

    ÒıÓà Olaf Dr¨¹mmer <o.druemmer@callassoftware.com>:

    > Hi,
    >
    > jefsey@jefsey.com wrote Thu, 26 Oct 2006 03:11:35 +0200
    >
    > >I have a PDF document including some non-roman characters I would
    > >like to obtain the code element value. Is there a tool able to do that?
    > >Thank you for the tip.
    > >jfc
    >
    >
    > Programmatically this in itself is not an easy task, and as far as I
    > know there is no good freeware/opensource implementation/code readily
    > available.
    >
    > PDFlib does offer a text extraction tool kit (TET) - see www.pdflib.com
    > for more info that I think would do the trick.
    >
    > Acrobat's API also offers access to text on a page (the SDK is publicly
    > available).
    >
    >
    > If you just need to do it occasionally/as a user you could just use
    > Acrobat (recent versions work better than older ones), select the text,
    > copy it to the pasteboard and then paste it into some other app that can
    > give you the unicode values.
    >
    >
    > Also: probably next month Acrobat 8 Professional will be released by
    > Adobe. It contains a component called "Preflight" which in turn does
    > have two features that may be of interest here:
    > - a browser for the internal structure of embedded fonts (will give you
    > Unicode values for glyphs in the fonts, given they are defined/can be
    > established)
    > - an inventory feature that (among other things) creates tables of
    > glyphs for the fonts used in the PDF, together with character IDs and
    > Unicode code points (and Unicode glyph names).
    >
    >
    > For all suggestions please keep in mind that in some cases
    > - it may not be possible to establish the Unicode value
    > - the Unicode value may be incorrect because the information in the PDF/
    > font is incorrect
    >
    >
    > Olaf Druemmer
    > callas software (which happens to be the company having developed
    > Preflight for Acrobat ;-> )
    >
    >
    >
    >
    >

    -------------------------------------------------
    This mail sent through Virus Free Email
    http://www.vfemail.net



    This archive was generated by hypermail 2.1.5 : Thu Oct 26 2006 - 03:49:25 CST