Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Oct 04 2006 - 12:00:42 CST

  • Next message: Addison Phillips: "Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"

    On Wed, 4 Oct 2006, Paul Johnston wrote:

    > I am using Unicode throughout my system (a web-based database for tracking
    > work). I am forced to use a tool (htmldoc - for html to PDF conversion) that
    > does not support unicode in any manner.

    Are you sure? I don't mean the limitations of the tool but the necessity
    of using that particular tool. I have successfully converted an HTML
    document with over 1,000 different Unicode characters into PDF, using
    free software available for a normal PC. But maybe you have some policy
    restrictions. (I used PDFCreator. I've heard positive comments about
    CutePDF Writer, and it appears to be cleaner and faster.)

    > This should not be a significant
    > problem in practice, as all the data is in English. However, I am having
    > problems with a few characters, primarily an apostrophe-like character (don't
    > know the code offhand; it's not in Latin-1).

    It might be _the_ apostrophe used in correctly spelled English, the
    curly apostrophe, called LEFT SINGLE QUOTATION MARK in Unicode (and
    distinct from the Ascii apostrophe, called APOSTROPHE in Unicode). That
    character belongs to Windows-1252, also known as Windows Latin 1, but not
    to ISO Latin 1.

    > If I encode the output as Windows-1251, the character causes an error.

    I'm not sure I understand the situation at all. I don't think you can mean
    Windows-1251, which is Windows Cyrillic, with Cyrillic (Russian) letters
    in the "upper half". I guess this was an "off by one" case and you meant
    Windows-1252. Then the question is what is going on, if your tool can
    produce Windows-1252 output, as one might expect. Is there some problem
    with the _source_? In HTML, you can represent a curly apostrophe in
    several ways; maybe the tool cannot handle all of them.

    > If I used utf-8 it causes visual garbage in the output.

    I'm afraid I cannot visualize the problem. How can you use utf-8 if the
    tool does not support Unicode at all? We might need a more detailed
    description of the process.

    > What would be ideal is to
    > perform a "visually approximate" conversion to Windows-1251, which would
    > replace this with a regular apostrophe.

    If this is really about curly apostrophe and about a system that cannot
    deal with it, then the usual way is to replace it by the Ascii apostrophe.

    > I know Windows can do this, as retrieving values from controls using a
    > non-Unicode interface does exactly this conversion.

    I don't see what you mean by that, but I have seen Windows software map
    some Windows Latin 1 characters to ISO Latin 1 characters. That's what
    e.g. Outlook Express (silently!) does if the default encoding for outgoing
    messages has been set to iso-8859-1 but the data contains e.g. a curly
    apostrophe.

    My analysis (well, guess) might, as usual, be all wrong. Perhaps the
    apostrophe-like character is really e.g. MODIFIER LETTER RIGHT HALF RING.
    It's not used in normal English, but it is used in scientific
    transliteration of Arabic words and might therefore conceivably appear
    within English text.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Wed Oct 04 2006 - 12:03:21 CST