Re: extracting code values from PDF?

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Thu Oct 26 2006 - 04:12:48 CST

Next message: Otto Stolz: "Re: extracting code values from PDF?"

Previous message: vunzndi@vfemail.net: "Re: extracting code values from PDF?"
In reply to: Olaf Drümmer: "Re: extracting code values from PDF?"
Next in thread: Otto Stolz: "Re: extracting code values from PDF?"
Reply: Otto Stolz: "Re: extracting code values from PDF?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thu, 26 Oct 2006, Olaf Drümmer wrote:

>> I have a PDF document including some non-roman characters I would
>> like to obtain the code element value. Is there a tool able to do that?
- -
> Programmatically this in itself is not an easy task, and as far as I
> know there is no good freeware/opensource implementation/code readily
> available.

A fully automatic method for bulk processing of characters in PDF
documents, it's probably difficult indeed. But quite often, it's just a
matter of knowing about a few characters, or individual characters. (Say,
you are reading a standard in PDF format and you would very much like to
know what it means by some dot- or comma-looking spot in some notation or
transliteration scheme, in Unicode terms. This generally won't give an
authoritative answer, since there is none, but at least knowing the
Unicode number of the character tells us how the person who created the
PDF file may have thought.)

So here's some practical how-to advice for such simple things. (I sent
this to Jefsey only by accident, when I meant to send to the list; sorry
for the duplication, Jefsey.)

On Windows, for example, you can select a piece of text (or even a single
character) in Adobe Reader using the text select tool, copy it onto the
clipboard, paste it in MS Word or WordPad, position the cursor after a
character and press Alt-x. The character is then replaced by its Unicode
code number. Press Alt-x again to have it turned back to the character.
(If the preceding character is a hexadecimal digit, you need to insert
temporarily e.g. a space before this restore operation.)

This is of course clumsy for studying many characters, and there are
surely better tools for such purposes, but this method requires no
additional software.

PDF protections may prevent copying, though, depending on how the document
was created.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Next message: Otto Stolz: "Re: extracting code values from PDF?"
Previous message: vunzndi@vfemail.net: "Re: extracting code values from PDF?"
In reply to: Olaf Drümmer: "Re: extracting code values from PDF?"
Next in thread: Otto Stolz: "Re: extracting code values from PDF?"
Reply: Otto Stolz: "Re: extracting code values from PDF?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 26 2006 - 04:14:15 CST