Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Martin J. Dürst (duerst@it.aoyama.ac.jp)
Date: Fri Nov 05 2010 - 03:17:25 CST

  • Next message: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    On 2010/11/05 8:30, Markus Scherer wrote:

    > If the conversion libraries you are using do not support this (I don't
    > know), then you could ask for such options. Or use conversion libraries that
    > do support such options (like ICU and Java).

    The encoding conversion library in Ruby 1.9 also supports this. Here's
    an example:

    >>>>
    utf16_borken = "\x00a\x00b\xD8\x00\x00c\x00d".force_encoding('UTF-16BE')
    utf8_clean = utf16_borken.encode('UTF-8',
                                      invalid: :replace, replace: '')
    puts utf8_clean # prints "abcd"
    >>>>

    In general, and in particular for Unicode Encoding Forms, it's a bad
    idea to just "replace with nothing", because of the security
    implications this might have. I guess that's the reason Perl doesn't
    allow this. But if you are sure there are no security implications, then
    there is no reason to not remove lone surrogates.

    Regards, Martin.

    P.S.: Why would you use Ruby for conversion when programming in Perl?
    You could just as well program in Ruby, it's much more fun!

    -- 
    #-# Martin J. Dürst, Professor, Aoyama Gakuin University
    #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
    


    This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 03:19:38 CST