From: Martin J. Dürst (firstname.lastname@example.org)
Date: Fri Nov 05 2010 - 03:17:25 CST
On 2010/11/05 8:30, Markus Scherer wrote:
> If the conversion libraries you are using do not support this (I don't
> know), then you could ask for such options. Or use conversion libraries that
> do support such options (like ICU and Java).
The encoding conversion library in Ruby 1.9 also supports this. Here's
utf16_borken = "\x00a\x00b\xD8\x00\x00c\x00d".force_encoding('UTF-16BE')
utf8_clean = utf16_borken.encode('UTF-8',
invalid: :replace, replace: '')
puts utf8_clean # prints "abcd"
In general, and in particular for Unicode Encoding Forms, it's a bad
idea to just "replace with nothing", because of the security
implications this might have. I guess that's the reason Perl doesn't
allow this. But if you are sure there are no security implications, then
there is no reason to not remove lone surrogates.
P.S.: Why would you use Ruby for conversion when programming in Perl?
You could just as well program in Ruby, it's much more fun!
-- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:email@example.com
This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 03:19:38 CST