From: Jim Monty (jim.monty@yahoo.com)
Date: Thu Nov 04 2010 - 15:52:19 CST
Markus Scherer wrote:
> Doug Ewell wrote:
> > It may be that broken UTF-16 text doesn't appear that often in the
> > realworld.
>
> 16-bit Unicode is convenient in that when you find an unpaired surrogate
> (that is, it's not well-formed UTF-16) you can usually just treat it like
> a surrogate code point which normally has default properties much like an
> unassigned code point or noncharacter. It case-maps to itself, normalizes
> to itself, has default Unicode property values (except for the general
> category), etc.
>
> In other words, when you process 16-bit Unicode text it takes no effort to
> handle unpaired surrogates, other than making sure that you only assemble a
> supplementary code point when a lead surrogate is really followed by a trail
> surrogate. Hence little need for cleanup functions -- but if you need one,
> it's trivial to write one for UTF-16.
Thank you! This is what I've always understood about the design of the UTFs:
they're generally quite robust. One errant character doesn't make the whole text
unusable. And in the case of transcoding from, say, UTF-16 to UTF-8, it's
reasonably straightforward to handle anomalies.
So imagine my dismay when I wrote a trivial Perl script to convert a UTF-16 file
to a UTF-8 file and it died immediately on the first text file I tested it on. I
got this error message:
UTF-16:Malformed LO surrogate db82 at utf16-to-utf8.pl line 24,
<$utf16_dat_fh> line 119.
So I checked the documentation
(http://search.cpan.org/dist/Encode/Unicode/Unicode.pm#Error_Checking) and read
this:
Unlike most encodings which accept various ways to handle errors,
Unicode encodings simply croaks.
...
Unlike other encodings where mappings are not one-to-one against
Unicode, UTFs are supposed to map 100% against one another. So
Encode is more strict on UTFs.
Consider that "division by zero" of Encode :)
I see nothing to grin about. Division by zero? Seriously? This effectively means
I can't use Perl to transcode Unicode, at least not in the imperfect world *I*
live in.
And GNU iconv is no better. It fails to transcode the same file with an even
more laconic error message:
iconv: Data.txt: cannot convert
I guess I should appeal to the maintainer of the Perl core Encode module to
loosen the shackles a bit, eh?
Thank you all for your very helpful responses.
Jim Monty
This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 15:57:24 CST