Re: minimizing size (was Re: allocation of Georgian letters)

From: Eric Muller ([email protected])
Date: Sat Feb 09 2008 - 19:46:17 CST

Next message: James Kass: "Publishing complex scripts (was Re: minimizing ...)"

Previous message: Eric Muller: "Re: minimizing size (was Re: allocation of Georgian letters)"
In reply to: James Kass: "Re: minimizing size (was Re: allocation of Georgian letters)"
Next in thread: James Kass: "Publishing complex scripts (was Re: minimizing ...)"
Reply: James Kass: "Publishing complex scripts (was Re: minimizing ...)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

James Kass wrote:
>
>
> PDF has long been touted as *the* way to safely send text with the
> assurance that the recipients will be able to display that text exactly
> as the author intended.

Actually, it is "final form documents", not text.
>
> Without any real knowledge of the PDF format and what happens when
> converting a file to PDF, it appears to me that it is not text which is
> being embedded. Rather, the process is embedding glyphs.

Glyphs is the primary construct that is needed for "final form
documents". Glyphs are mandatory in PDFs.

When you see something like "(the car) Tj" in a PDF content stream, the
"the car" piece is only accidentally looking like text (of course an
intended accident, but an accident nevertheless).
> If a glyph
> is mapped to a Unicode value, at least some applications can return that
> value. But, if the glyph is not mapped to a unicode value (which is
> normally the case with presentation forms used in complex scripts),
> there does not seem to be any effort made to preserve the Unicode
> string which generated the presentation form. And that's really a
> shame.

Actually, there are ways to include characters in additions to the
glyphs, even when the character/glyph correspondence is not one-for-one
(look for /ActualText in the PDF reference; /ToUnicode maps are
conceptually optimizations of that), but whether those ways are
exploited depend on the PDF generator. Some generators use nothing,
other will generate only /ToUnicode (what you describe) which can
account for only 1-to-1 character/glyph mappings, others will use the
full apparatus.

For example, if you take the PDFs generated for the UDHR in Unicode
project (e.g.
http://www.unicode.org/udhr/assemblies/first_article_subset.pdf for a
small comprehensive example), then except for the space problem
mentioned earlier, I think that you can copy from Acrobat and paste in
Notepad and get back all the text.

Eric.

Next message: James Kass: "Publishing complex scripts (was Re: minimizing ...)"
Previous message: Eric Muller: "Re: minimizing size (was Re: allocation of Georgian letters)"
In reply to: James Kass: "Re: minimizing size (was Re: allocation of Georgian letters)"
Next in thread: James Kass: "Publishing complex scripts (was Re: minimizing ...)"
Reply: James Kass: "Publishing complex scripts (was Re: minimizing ...)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 19:48:14 CST