Re: unicode entities, "beginner" questions...

From: Philippe VERDY (
Date: Sun Mar 13 2005 - 09:57:20 CST

  • Next message: "Re: unicode entities, "beginner" questions..."

    I understnd your frustration, but most of these problems come from the need, in application programming interface, to keep the compatibility with legacy interfaces.
    For example I manage a set of translations for a Java app, within sets of .properties files. Unfortunately, the Java API for handling resource bundles still does not know (even in Java 1.5) how to recognize UTF-8 encoded files (even if we include a leading BOM), so the Java resource bundle loader will only process files using the legacy ISO-8859-1 or US-ASCII character set. Any other Unicode character must be encoded with so-called "Unicode escapes" (with form "\uXXXX" where XXXX is a hex-encoded UTF-16 encoding unit).

    This is frustrating, and when managing translations, it is nearly impossible to find translators that would have the required technical knowledge to work with this format. For this reason, I give to translators templates written with UTF-8 encoded files (with a leading BOM), that are much more user-friendly. I let them work on this version, and use the UTF-8 file as the reference file for all translations.
    The actual .properties files are generated automatically with a home-made validation tool that check the overall format, check the presence of duplicate resource keys, reorder keys, make them properly delimited with no extra spaces, check punctuation, and the presence of variable place-holders, and also creates the actual .properties file in a way similar to the Java JDK tool "native2ascii -encoding UTF-8" (except that my tool converts from UTF-8 to ISO-8859-1, letting all ISO-8859-1 characters unescaped, including all those that are not US-ASCII); also this tool works with an internal CVS-based history tool, and can generate comments for helping translators, which are preserved in the UTF-8 reference source file, but filtered out in the generated final .properties files where all comments and blank lines are removed.

    So I am no more limited by the Java API, and translators can now work with more riendly UTF-8 files. (In fact I also accept that translators send us translations in Word documents, or within the body of an email, where I can just copy/paste the text). The Java native2ascii tool is very useful for me to make the necessary code conversion (because some translators will send me translations in some legacy Windows 125x codepage, or SJIS, or Mac 8-bit charset).
    I also use the result of native 2ascii to make sure (and correct if needed) that resources in RTL languages are properly formated (notably because RTL resources need to use mixed Latin and Arabic or Hebrew scripts, in lines starting by Latin, in cases where the Bidi Unicode algorithm may incorrectly render the UTF-8 file, and where translators may have swapped elements; with the hex-encoded version, I can easily check the correct encoding order, and edit it in the ASCII version before reconverting it to UTF-8 to regenerate the corrected reference file).
    Note that this problem with RTL-script languages occurs equally within XML or .properties or INI file formats. There's for now no definitive format that works well with RTL languages, unless the Latin property key or element tag is left on a separated line than the translated value (this is the only way to have the BiDi algorithm not falling into failtraps due to the necessary mixed scripts for the property key and the translated property value)

    > Message du 13/03/05 08:25
    > De :
    > A :
    > Copie :
    > Objet : unicode entities, "beginner" questions...
    > I apologize for the level of the questions. If the place is not right
    > I'd appreciate to get pointers to lists where I can get information.
    > I am a translator, working in Japanese, English and French and I use
    > tools that work with mostly utf-8 files, namely:
    > - (or rather the OSX version NeoOffice/J) as a file
    > converted and
    > - OmegaT (a Java app) as a translation memory tool.
    > I have had issues with both since I realized that, contrary to unicode
    > supporting OSX apps (TextEdit to give a simple example, but also most
    > text editors on OSX) the above apps translate all the Japanese (and
    > French non ascii characters) to non human readable entities that make
    > direct editing of output files almost impossible.
    > I seem to not understand the reality of what unicode is and I thus am
    > stuck with files and no way to convert them to human readable output.
    > So my questions are:
    > Why do those tools favor a non-human readable output form ? Is there a
    > valid technical reason to do so ?
    > What are the technical differences between human readable unicode
    > output and entity based unicode output ?
    > Are there easy ways to convert from one to the other ?
    > Are there other forms a unicode character can take ?
    > I think I understand that fundamentally a character is just a number to
    > the computer that points to a place in a list for display purposes and
    > that is further modified or "encoded" for transmission purposes.
    > But I don't see where the entities and their necessity fits in this...
    > When I started as a html writer, about 10 years ago, I used to convert
    > my French accented letters to html entities to "make sure" that they'd
    > be displayed properly. But with encoding/character set recognition this
    > is no more necessary and I can write French, or Japanese, save the text
    > in the proper encoding and document that encoding in the file for
    > interpretation purposes.
    > It seems to me using entities now is going back 10 years or so,
    > especially when one works with applications that _expect_ utf-8
    > files... Entities may be necessary for rare characters, but for all the
    > rest ???
    > Thanks in advance for the answers and /or clarifications & pointers...
    > Sincerely,

    This archive was generated by hypermail 2.1.5 : Sun Mar 13 2005 - 09:58:53 CST