Re: Joined "ti" coded as "Ɵ" in PDF

From: Don Osborn <dzo_at_bisharat.net>
Date: Thu, 17 Mar 2016 13:45:34 -0400

Thanks Leonardo, that is my initial observation. And it has implications
for web searches.

And there's more. Apparently this is one of a number of such
substitutions, which taken together begin to look like the old
pre-Unicode hacks of 8-bit fonts. And I found some of them via web
search in a number of Google Books and pages on issuu.com. Evidently
some kind of font issue, and not random assignments. From the same document:

ff ligature = ī
fl ligature = Ň
ft ligature = Ō
tt ligature = Ʃ

And perhaps others. Seems to defeat the intent of Unicode, as these
documents and pages will not come up in typical web search on the normal
spellings (unless maybe Google is incorporating an algorithm to include
results for say "internaƟonal" in a search on the term "international"?).

Don

On 3/17/2016 1:37 PM, Leonardo Boiko wrote:
> The PDF *displays* correctly. But try copying the string 'ti' from
> the text another application outside of your PDF viewer, and you'll
> see that the thing that *displays* as 'ti' is *coded* as Ɵ, as Don
> Osborn said.
>
>
> 2016-03-17 14:26 GMT-03:00 Pierpaolo Bernardi <olopierpa_at_gmail.com>:
>> That document displays correctly for me using both the pdf viewer
>> built into chrome and the standalone Acrobat reader v.11. The problem
>> could be in your PDF viewer? What are you viewing the document with?
>>
>> On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn <dzo_at_bisharat.net> wrote:
>>> Odd result when copy/pasting text from a PDF: For some reason "ti" in the
>>> (English) text of the document at
>>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
>>> is coded as "Ɵ". Looking more closely at the original text, it does appear
>>> that the glyph is a "ti" ligature (which afaik is not coded as such in
>>> Unicode).
>>>
>>> Out of curiosity, did a web search on "internaƟonal" and got over 11k hits,
>>> apparently all PDFs.
>>>
>>> Anyone have any idea what's going on? Am assuming this is not a deliberate
>>> choice by diverse people creating PDFs and wanting "ti" ligatures for
>>> stylistic reasons. Note the document linked above is current, so this is not
>>> (just) an issue with older documents.
>>>
>>> Don Osborn
Received on Thu Mar 17 2016 - 12:46:31 CDT

This archive was generated by hypermail 2.2.0 : Thu Mar 17 2016 - 12:46:31 CDT