Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Martin J. Dürst (duerst@it.aoyama.ac.jp)
Date: Fri Nov 05 2010 - 03:17:25 CST

Next message: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

Previous message: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2010/11/05 8:30, Markus Scherer wrote:

> If the conversion libraries you are using do not support this (I don't
> know), then you could ask for such options. Or use conversion libraries that
> do support such options (like ICU and Java).

The encoding conversion library in Ruby 1.9 also supports this. Here's
an example:

>>>>
utf16_borken = "\x00a\x00b\xD8\x00\x00c\x00d".force_encoding('UTF-16BE')
utf8_clean = utf16_borken.encode('UTF-8',
invalid: :replace, replace: '')
puts utf8_clean # prints "abcd"
>>>>

In general, and in particular for Unicode Encoding Forms, it's a bad
idea to just "replace with nothing", because of the security
implications this might have. I guess that's the reason Perl doesn't
allow this. But if you are sure there are no security implications, then
there is no reason to not remove lone surrogates.

Regards, Martin.

P.S.: Why would you use Ruby for conversion when programming in Perl?
You could just as well program in Ruby, it's much more fun!

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Next message: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Previous message: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 03:19:38 CST