Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)

From: Addison Phillips (addison@yahoo-inc.com)
Date: Wed Oct 04 2006 - 12:24:31 CST

  • Next message: Kenneth Whistler: "Re: Unicode and RFC 4690"

    I think you mean "windows-1252", the Western European code page. Code
    page 1251 is the Cyrillic code page.

    Windows-1252, like many Microsoft code pages, differs from the related
    "standard" encoding. In this case, it is a superset of ISO 8859-1 (often
    referred to as Latin-1). The difference is that Microsoft added 27
    characters in the C1 control range (0x80->0x9F), including the Euro
    symbol and a variety of "typesetter's quotes". These often cause
    problems for software expecting pure ISO 8859-1.

    HTMLDOC has both command-line and GUI options that allow you to select
    the appropriate windows encoding (sorry, not UTF-8) to use when reading
    the source files. You should also include a correct <meta> tag declaring
    the encoding to be "windows-1252" and *not* "iso-8859-1" in your HTML
    documents.

    If that doesn't work, you can also use HTML entities in your pages to
    replace the characters. For example, &rsquo; is a right single quote. Or
    you can use a transliterating converter (such as the //TRANSLIT option
    on libiconv) to approximate the right results. (Caution: you may
    experience data degradation with this last "solution")

    Hope that helps.

    Addison

    -- 
    Addison Phillips
    Globalization Architect -- Yahoo! Inc.
    Internationalization is an architecture.
    It is not a feature.
    Paul Johnston wrote:
    > Hi,
    > 
    > I am using Unicode throughout my system (a web-based database for 
    > tracking work). I am forced to use a tool (htmldoc - for html to PDF 
    > conversion) that does not support unicode in any manner. This should not 
    > be a significant problem in practice, as all the data is in English. 
    > However, I am having problems with a few characters, primarily an 
    > apostrophe-like character (don't know the code offhand; it's not in 
    > Latin-1).
    > 
    > If I encode the output as Windows-1251, the character causes an error. 
    > If I used utf-8 it causes visual garbage in the output. What would be 
    > ideal is to perform a "visually approximate" conversion to Windows-1251, 
    > which would replace this with a regular apostrophe. I am happy to accept 
    > the risks that such an approximation carries.
    > 
    > I know Windows can do this, as retrieving values from controls using a 
    > non-Unicode interface does exactly this conversion. However, I have not 
    > been able to find out how I can perform the conversion at will. I 
    > apologise if this is not the most appropriate forum for this question, 
    > but I have been looking long ang hard for this without success.
    > 
    > Many thanks for any help you can offer,
    > 
    > Paul
    > 
    > P.S. If someone can suggest a unicode compatible replacement for 
    > htmldoc, that would satisfy me too!
    > 
    > 
    > 
    


    This archive was generated by hypermail 2.1.5 : Wed Oct 04 2006 - 12:26:23 CST