Re: extracting code values from PDF?

From: vunzndi@vfemail.net
Date: Thu Oct 26 2006 - 03:45:46 CST

Next message: Jukka K. Korpela: "Re: extracting code values from PDF?"

Previous message: Olaf Dr黰mer: "Re: extracting code values from PDF?"
In reply to: Olaf Dr黰mer: "Re: extracting code values from PDF?"
Next in thread: Jukka K. Korpela: "Re: extracting code values from PDF?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I agree with Olaf, the way to do it is to change the pdf into text. Whether
that is possible or not depends on how the pdf was made -- sometimes things are
saved in pdf as images not as characters then the only way is by OCR.

There are many ways to get from the test to the code

How many page are you dealing with?

John Knightley

引用 Olaf Drümmer <o.druemmer@callassoftware.com>:

> Hi,
>
> jefsey@jefsey.com wrote Thu, 26 Oct 2006 03:11:35 +0200
>
> >I have a PDF document including some non-roman characters I would
> >like to obtain the code element value. Is there a tool able to do that?
> >Thank you for the tip.
> >jfc
>
>
> Programmatically this in itself is not an easy task, and as far as I
> know there is no good freeware/opensource implementation/code readily
> available.
>
> PDFlib does offer a text extraction tool kit (TET) - see www.pdflib.com
> for more info that I think would do the trick.
>
> Acrobat's API also offers access to text on a page (the SDK is publicly
> available).
>
>
> If you just need to do it occasionally/as a user you could just use
> Acrobat (recent versions work better than older ones), select the text,
> copy it to the pasteboard and then paste it into some other app that can
> give you the unicode values.
>
>
> Also: probably next month Acrobat 8 Professional will be released by
> Adobe. It contains a component called "Preflight" which in turn does
> have two features that may be of interest here:
> - a browser for the internal structure of embedded fonts (will give you
> Unicode values for glyphs in the fonts, given they are defined/can be
> established)
> - an inventory feature that (among other things) creates tables of
> glyphs for the fonts used in the PDF, together with character IDs and
> Unicode code points (and Unicode glyph names).
>
>
> For all suggestions please keep in mind that in some cases
> - it may not be possible to establish the Unicode value
> - the Unicode value may be incorrect because the information in the PDF/
> font is incorrect
>
>
> Olaf Druemmer
> callas software (which happens to be the company having developed
> Preflight for Acrobat ;-> )
>
>
>
>
>

-------------------------------------------------
This mail sent through Virus Free Email
http://www.vfemail.net

Next message: Jukka K. Korpela: "Re: extracting code values from PDF?"
Previous message: Olaf Dr黰mer: "Re: extracting code values from PDF?"
In reply to: Olaf Dr黰mer: "Re: extracting code values from PDF?"
Next in thread: Jukka K. Korpela: "Re: extracting code values from PDF?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 26 2006 - 03:49:25 CST