Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Wed Nov 03 2010 - 18:33:19 CST

  • Next message: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    * Jim Monty wrote:
    >I mean the sixty-six code point Unicode reserves as noncharacters (e.g.,
    >U+FFFE).
    >
    >    http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
    >
    >But I'm most keenly interested in a utility that detects broken UTF-16 surrogate
    >pairs. By this, I mean a 16-bit code unit in the range from 0xD800 thru 0xDBFF
    >not immediately followed by a 16-bit code unit in the range from 0xDC00 thru
    >0xDFFF, and vice versa.

    The simple solution to that is a small state machine that you put each
    byte through, http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ does that
    for UTF-8. UTF-16 is quite a bit simpler, so to use that approach you
    can make a map between bytes and character classes (to save space):

      my @byte2class = (
        (map { 0 } 0x00 .. 0xD7),
        (map { 1 } 0xD8 .. 0xDB),
        (map { 2 } 0xDC .. 0xDF),
        (map { 0 } 0xE0 .. 0xFF),
      );

    and then you create a transition table

      $d[0][0] = 1; # - seen a normal lead byte
      $d[0][1] = 2; # - seen a high surrogate lead
      $d[0][2] = ?; # - unexpected low surrogate
      
      $d[1][0] = 0; # after a normal lead byte
      $d[1][1] = 0; # anything goes, so this
      $d[1][2] = 0; # returns to the start state
      
      $d[2][0] = 3; # this is the second byte
      $d[2][1] = 3; # of a high surrogate which
      $d[2][2] = 3; # can also be anything
      
      $d[3][0] = ?; # - bad normal lead byte
      $d[3][1] = ?; # - bad high surrogate lead
      $d[3][2] = 4; # - what we are looking for
      
      $d[4][0] = 0; # this is the second byte
      $d[4][1] = 0; # of a low surrogate; when
      $d[4][2] = 0; # it's read, loop to start

    You did not say how you want to resynchronize the stream, so I left some
    of the entries blank (you could for instance make a state 5, check for
    that state, emit a replacement character in place of the byte you read,
    and then let the automaton go from state 5 back to state 0, as one error
    recovery strategy). You'd then simulate the state machine like so:

      while (...) {
        my $byte = ...;
        $state = $d[ $state ][ $byte2class[ $byte ] ];
        # error handling would go here
      }

    You could handle other undesired sequences in the same way, but that'd
    make the automaton a bit hard to write manually. Also, you would have to
    make two separate automata, one for each of the variants (though you can
    of course use the same table and just set the start state accordingly).

    -- 
    Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
    


    This archive was generated by hypermail 2.1.5 : Wed Nov 03 2010 - 18:36:48 CST