RE: extracting code values from PDF?

From: Rick Cameron (
Date: Thu Oct 26 2006 - 12:03:36 CST

  • Next message: Magda Danish (Unicode): "30th Internationalization & Unicode Conference Announces Keynote Panel"

    Here's my understanding of how PDF files are structured, and why copying
    & pasting from the file cited by Andreas does not work. Anyone who has
    better knowledge of the mysteries of PDF, please feel free to correct.

    A PDF file doesn't actually contain characters; it contains fonts and
    glyph indices. In order for conversion to characters (Unicode or
    otherwise) to be possible, the PDF file must specify the mapping from
    glyph indices to Unicode code points (for each font, I think).

    It appears that this PDF file doesn't have this mapping for the font
    used for the Arabic characters. It looks to me like copy & paste causes
    the raw glyph indices to be interpreted as code points, either in
    Unicode or, perhaps, in my default native character set.


    - rick cameron

    -----Original Message-----
    From: [] On
    Behalf Of Andreas Prilop
    Sent: Thursday, 26 October 2006 7:26
    Subject: Re: extracting code values from PDF?

    On Thu, 26 Oct 2006, Jukka K. Korpela wrote:

    > On Windows, for example, you can select a piece of text (or even a
    > single
    > character) in Adobe Reader using the text select tool, copy it onto
    > the clipboard, paste it in MS Word or WordPad, position the cursor
    > after a character and press Alt-x. The character is then replaced by
    > its Unicode code number.

    Have you actually tried it?
    For example, take some Arabic letters from

    This archive was generated by hypermail 2.1.5 : Thu Oct 26 2006 - 12:05:07 CST