Re: FW: Subj: Converting from UCS-2 to UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Aug 20 2005 - 05:24:22 CDT

  • Next message: Philippe Verdy: "Re: 28th IUC paper - Tamil Unicode New"

    From: "Gregg Reynolds" <unicode@arabink.com>
    > Philippe Verdy wrote:
    >>
    > ...
    >>
    >> For those users that still run Windows 95/98/ME, this won't work, as
    >> these systems can only do the following:
    >>
    > ...
    >> So there are still applications needing converters based on other
    >> routines and mappings.
    >>
    >
    > Another possibility I just discovered in my ever-expanding list of
    > neglected bookmarks is the Free recode package at
    > http://recode.progiciels-bpi.ca/index.html

    Of course, and it is in fact better than ''iconv'', and portable to other
    environments than *nix/Linux.
    ''iconv'' does not handle correctly stateful conversions or conversions of
    large charsets, but ''recode'' performs them marvelously; that's why the
    latter supports many more charsets, and ''iconv'' is now really deprecated.

    Don't forget the other tool from the Java SDK: ''native2ascii''. It can
    perform conversions from/to a lot of charsets using a special Java format as
    the intermediate representation based on:

        * the UTF-16 encoding for the internal representation (yes it can handle
    characters present in various charsets that are mapped out of the BMP, by
    using surrogate pairs);

        * an encoding syntax normally used for Java source files and .properties
    files:

            ** it can internally represent UTF-16 code units that fall out of
    the ISO 8859-1 range using hexadecimal "Unicode escaped sequences" with the
    form "\uXXXX" where XXXX is any hexadecimal UTF-16 code point.

            ** it will represent the "\" character itself using the "\\" escape
    sequence.

            ** other escaped sequences are also recognized when converting from
    this Java representation to other charsets.

        * the ISO 8859-1 charset (not only ASCII as the tool name suggests) for
    the physical encoding of the syntax.

    This converter can be used reliably on all charsets supported by your
    currently Java VM. If you have installed Java with the i18n supplemantary
    jars, you'll get support for all ISO 8859 charsets, as well as many national
    Asian and Russian standards, and lots of DOS/OEM and Windows codepages, and
    legacy 8-bit MacOS charsets.

    A typical usage could be for example:

        native2ascii -encoding ISO8859-2 < input-file.ISO8859-2.txt |
        native2ascii -reverse -encoding UTF-8 > output-file.UTF-8.txt

    where:

        * The internal Java representation is not stored but feeds the pipe for
    the second invokation. Note that when not using the "-reverse" parameter,
    its output will be ASCII only (meaning that it will also convert all
    non-ASCII characters of the ISO 8859-1 set into hexadecimal Unicode escaped
    sequences).

        * The "-encoding" parameter (with its required value parameter) is
    optional, but its default value is the current platform's native GUI charset
    (for example windows-1252 on Western European localizations of Windows), so
    it generally must be specified.

    So to convert from UCS-2 or UTF-16 to UTF-8, when would perform:

        native2ascii -encoding UTF-16 < input-file.UTF-16.txt |
        native2ascii -reverse -encoding UTF-8 > output-file.UTF-8.txt



    This archive was generated by hypermail 2.1.5 : Sat Aug 20 2005 - 05:28:29 CDT