Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Jim Monty (jim.monty@yahoo.com)
Date: Wed Nov 03 2010 - 17:20:44 CST

Next message: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

Previous message: Doug Ewell: "RE: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Doug Ewell: "RE: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell wrote:
> Jim Monty <jim dot monty at yahoo dot com> wrote:
> > Is there a utility, preferably open source and written in C, that inspects
> > UTF-16/UTF-16BE/UTF-16LE text and identifies broken surrogate pairs and
>illegal
>
> > characters? Ideally, the utility can both report illegal code units and
>"repair"
>
> > them by replacing them with U+FFFD.
>
> What's an "illegal" character, for purposes of this exercise? Do you
> mean a noncharacter, or something else?

I mean the sixty-six code point Unicode reserves as noncharacters (e.g.,
U+FFFE).

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

But I'm most keenly interested in a utility that detects broken UTF-16 surrogate
pairs. By this, I mean a 16-bit code unit in the range from 0xD800 thru 0xDBFF
not immediately followed by a 16-bit code unit in the range from 0xDC00 thru
0xDFFF, and vice versa.

http://en.wikipedia.org/wiki/UTF-16/UCS-2#Encoding_of_characters_outside_the_BMP

I need to repair broken UTF-16 text that some software (e.g., GNU iconv) and
programming languages (e.g., Perl) choke on. See this discussion of the topic on
PerlMonks.

http://www.perlmonks.org/?node_id=719833

Jim Monty

Next message: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Previous message: Doug Ewell: "RE: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Doug Ewell: "RE: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 03 2010 - 17:42:05 CST