From: Martin J. Dürst (duerst@it.aoyama.ac.jp)
Date: Fri Nov 05 2010 - 03:17:25 CST
On 2010/11/05 8:30, Markus Scherer wrote:
> If the conversion libraries you are using do not support this (I don't
> know), then you could ask for such options. Or use conversion libraries that
> do support such options (like ICU and Java).
The encoding conversion library in Ruby 1.9 also supports this. Here's
an example:
>>>>
utf16_borken = "\x00a\x00b\xD8\x00\x00c\x00d".force_encoding('UTF-16BE')
utf8_clean = utf16_borken.encode('UTF-8',
invalid: :replace, replace: '')
puts utf8_clean # prints "abcd"
>>>>
In general, and in particular for Unicode Encoding Forms, it's a bad
idea to just "replace with nothing", because of the security
implications this might have. I guess that's the reason Perl doesn't
allow this. But if you are sure there are no security implications, then
there is no reason to not remove lone surrogates.
Regards, Martin.
P.S.: Why would you use Ruby for conversion when programming in Perl?
You could just as well program in Ruby, it's much more fun!
-- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 03:19:38 CST