Re: extracting code values from PDF?

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Thu Oct 26 2006 - 04:12:48 CST

  • Next message: Otto Stolz: "Re: extracting code values from PDF?"

    On Thu, 26 Oct 2006, Olaf Drümmer wrote:

    >> I have a PDF document including some non-roman characters I would
    >> like to obtain the code element value. Is there a tool able to do that?
    - -
    > Programmatically this in itself is not an easy task, and as far as I
    > know there is no good freeware/opensource implementation/code readily
    > available.

    A fully automatic method for bulk processing of characters in PDF
    documents, it's probably difficult indeed. But quite often, it's just a
    matter of knowing about a few characters, or individual characters. (Say,
    you are reading a standard in PDF format and you would very much like to
    know what it means by some dot- or comma-looking spot in some notation or
    transliteration scheme, in Unicode terms. This generally won't give an
    authoritative answer, since there is none, but at least knowing the
    Unicode number of the character tells us how the person who created the
    PDF file may have thought.)

    So here's some practical how-to advice for such simple things. (I sent
    this to Jefsey only by accident, when I meant to send to the list; sorry
    for the duplication, Jefsey.)

    On Windows, for example, you can select a piece of text (or even a single
    character) in Adobe Reader using the text select tool, copy it onto the
    clipboard, paste it in MS Word or WordPad, position the cursor after a
    character and press Alt-x. The character is then replaced by its Unicode
    code number. Press Alt-x again to have it turned back to the character.
    (If the preceding character is a hexadecimal digit, you need to insert
    temporarily e.g. a space before this restore operation.)

    This is of course clumsy for studying many characters, and there are
    surely better tools for such purposes, but this method requires no
    additional software.

    PDF protections may prevent copying, though, depending on how the document
    was created.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Thu Oct 26 2006 - 04:14:15 CST