From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Oct 04 2006 - 12:00:42 CST
On Wed, 4 Oct 2006, Paul Johnston wrote:
> I am using Unicode throughout my system (a web-based database for tracking 
> work). I am forced to use a tool (htmldoc - for html to PDF conversion) that 
> does not support unicode in any manner.
Are you sure? I don't mean the limitations of the tool but the necessity 
of using that particular tool. I have successfully converted an HTML 
document with over 1,000 different Unicode characters into PDF, using 
free software available for a normal PC. But maybe you have some policy 
restrictions. (I used PDFCreator. I've heard positive comments about 
CutePDF Writer, and it appears to be cleaner and faster.)
> This should not be a significant 
> problem in practice, as all the data is in English. However, I am having 
> problems with a few characters, primarily an apostrophe-like character (don't 
> know the code offhand; it's not in Latin-1).
It might be _the_ apostrophe used in correctly spelled English, the 
curly apostrophe, called LEFT SINGLE QUOTATION MARK in Unicode (and 
distinct from the Ascii apostrophe, called APOSTROPHE in Unicode). That 
character belongs to Windows-1252, also known as Windows Latin 1, but not 
to ISO Latin 1.
> If I encode the output as Windows-1251, the character causes an error.
I'm not sure I understand the situation at all. I don't think you can mean 
Windows-1251, which is Windows Cyrillic, with Cyrillic (Russian) letters 
in the "upper half". I guess this was an "off by one" case and you meant 
Windows-1252. Then the question is what is going on, if your tool can 
produce Windows-1252 output, as one might expect. Is there some problem 
with the _source_? In HTML, you can represent a curly apostrophe in 
several ways; maybe the tool cannot handle all of them.
> If I used utf-8 it causes visual garbage in the output.
I'm afraid I cannot visualize the problem. How can you use utf-8 if the 
tool does not support Unicode at all? We might need a more detailed 
description of the process.
> What would be ideal is to 
> perform a "visually approximate" conversion to Windows-1251, which would 
> replace this with a regular apostrophe.
If this is really about curly apostrophe and about a system that cannot 
deal with it, then the usual way is to replace it by the Ascii apostrophe.
> I know Windows can do this, as retrieving values from controls using a 
> non-Unicode interface does exactly this conversion.
I don't see what you mean by that, but I have seen Windows software map 
some Windows Latin 1 characters to ISO Latin 1 characters. That's what 
e.g. Outlook Express (silently!) does if the default encoding for outgoing 
messages has been set to iso-8859-1 but the data contains e.g. a curly 
apostrophe.
My analysis (well, guess) might, as usual, be all wrong. Perhaps the 
apostrophe-like character is really e.g. MODIFIER LETTER RIGHT HALF RING. 
It's not used in normal English, but it is used in scientific 
transliteration of Arabic words and might therefore conceivably appear 
within English text.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Wed Oct 04 2006 - 12:03:21 CST