Re: FW: Subj: Converting from UCS-2 to UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Aug 20 2005 - 05:24:22 CDT

Next message: Philippe Verdy: "Re: 28th IUC paper - Tamil Unicode New"

Previous message: ndlogasundaram: "Re: Orrmulum -- U+204A -- large and small ?"
In reply to: Gregg Reynolds: "Re: FW: Subj: Converting from UCS-2 to UTF-8"
Next in thread: jarkko.hietaniemi@nokia.com: "RE: FW: Subj: Converting from UCS-2 to UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Gregg Reynolds" <unicode@arabink.com>
> Philippe Verdy wrote:
>>
> ...
>>
>> For those users that still run Windows 95/98/ME, this won't work, as
>> these systems can only do the following:
>>
> ...
>> So there are still applications needing converters based on other
>> routines and mappings.
>>
>
> Another possibility I just discovered in my ever-expanding list of
> neglected bookmarks is the Free recode package at
> http://recode.progiciels-bpi.ca/index.html

Of course, and it is in fact better than ''iconv'', and portable to other
environments than *nix/Linux.
''iconv'' does not handle correctly stateful conversions or conversions of
large charsets, but ''recode'' performs them marvelously; that's why the
latter supports many more charsets, and ''iconv'' is now really deprecated.

Don't forget the other tool from the Java SDK: ''native2ascii''. It can
perform conversions from/to a lot of charsets using a special Java format as
the intermediate representation based on:

* the UTF-16 encoding for the internal representation (yes it can handle
characters present in various charsets that are mapped out of the BMP, by
using surrogate pairs);

* an encoding syntax normally used for Java source files and .properties
files:

** it can internally represent UTF-16 code units that fall out of
the ISO 8859-1 range using hexadecimal "Unicode escaped sequences" with the
form "\uXXXX" where XXXX is any hexadecimal UTF-16 code point.

** it will represent the "\" character itself using the "\\" escape
sequence.

** other escaped sequences are also recognized when converting from
this Java representation to other charsets.

* the ISO 8859-1 charset (not only ASCII as the tool name suggests) for
the physical encoding of the syntax.

This converter can be used reliably on all charsets supported by your
currently Java VM. If you have installed Java with the i18n supplemantary
jars, you'll get support for all ISO 8859 charsets, as well as many national
Asian and Russian standards, and lots of DOS/OEM and Windows codepages, and
legacy 8-bit MacOS charsets.

A typical usage could be for example:

native2ascii -encoding ISO8859-2 < input-file.ISO8859-2.txt |
native2ascii -reverse -encoding UTF-8 > output-file.UTF-8.txt

where:

* The internal Java representation is not stored but feeds the pipe for
the second invokation. Note that when not using the "-reverse" parameter,
its output will be ASCII only (meaning that it will also convert all
non-ASCII characters of the ISO 8859-1 set into hexadecimal Unicode escaped
sequences).

* The "-encoding" parameter (with its required value parameter) is
optional, but its default value is the current platform's native GUI charset
(for example windows-1252 on Western European localizations of Windows), so
it generally must be specified.

So to convert from UCS-2 or UTF-16 to UTF-8, when would perform:

native2ascii -encoding UTF-16 < input-file.UTF-16.txt |
native2ascii -reverse -encoding UTF-8 > output-file.UTF-8.txt

Next message: Philippe Verdy: "Re: 28th IUC paper - Tamil Unicode New"
Previous message: ndlogasundaram: "Re: Orrmulum -- U+204A -- large and small ?"
In reply to: Gregg Reynolds: "Re: FW: Subj: Converting from UCS-2 to UTF-8"
Next in thread: jarkko.hietaniemi@nokia.com: "RE: FW: Subj: Converting from UCS-2 to UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Aug 20 2005 - 05:28:29 CDT