RE: extracting code values from PDF?

From: Rick Cameron (Rick.Cameron@businessobjects.com)
Date: Thu Oct 26 2006 - 12:03:36 CST

Next message: Magda Danish (Unicode): "30th Internationalization & Unicode Conference Announces Keynote Panel"

Previous message: Andreas Prilop: "Re: extracting code values from PDF?"
In reply to: Andreas Prilop: "Re: extracting code values from PDF?"
Next in thread: Eric Muller: "Re: extracting code values from PDF?"
Reply: Eric Muller: "Re: extracting code values from PDF?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Here's my understanding of how PDF files are structured, and why copying
& pasting from the file cited by Andreas does not work. Anyone who has
better knowledge of the mysteries of PDF, please feel free to correct.

A PDF file doesn't actually contain characters; it contains fonts and
glyph indices. In order for conversion to characters (Unicode or
otherwise) to be possible, the PDF file must specify the mapping from
glyph indices to Unicode code points (for each font, I think).

It appears that this PDF file doesn't have this mapping for the font
used for the Arabic characters. It looks to me like copy & paste causes
the raw glyph indices to be interpreted as code points, either in
Unicode or, perhaps, in my default native character set.

Cheers

- rick cameron

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
Behalf Of Andreas Prilop
Sent: Thursday, 26 October 2006 7:26
To: unicode@unicode.org
Subject: Re: extracting code values from PDF?

On Thu, 26 Oct 2006, Jukka K. Korpela wrote:

> On Windows, for example, you can select a piece of text (or even a
> single
> character) in Adobe Reader using the text select tool, copy it onto
> the clipboard, paste it in MS Word or WordPad, position the cursor
> after a character and press Alt-x. The character is then replaced by
> its Unicode code number.

Have you actually tried it?
For example, take some Arabic letters from
http://www.evertype.com/standards/af/af-locales.pdf

Next message: Magda Danish (Unicode): "30th Internationalization & Unicode Conference Announces Keynote Panel"
Previous message: Andreas Prilop: "Re: extracting code values from PDF?"
In reply to: Andreas Prilop: "Re: extracting code values from PDF?"
Next in thread: Eric Muller: "Re: extracting code values from PDF?"
Reply: Eric Muller: "Re: extracting code values from PDF?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 26 2006 - 12:05:07 CST