RE: Other Question, Problem, or Feedback

From: Dean Harding (dean.harding@dload.com.au)
Date: Mon Jun 12 2006 - 21:49:23 CDT

  • Next message: J Andrew Lipscomb: "Re: unicode Digest V6 #126"

    > > 1.Is it true that there are many ways of encoding the same character in
    > > UTF-16?
    >
    > No. There is exactly one way of encoding each character in UTF-16. See
    > TUS 4.0 Section 2.5 'Encoding Forms', especially p29.

    I think this may be referring to the various normalized forms for strings.
    For example, "e with an acute accent" could be <U+00E9> or it could be
    <U+0065, U+0301>

    Which CAN be a problem for regular expressions, unless they're designed with
    this in mind. The simplest solution is to normalize the input strings to the
    same form before doing matching (for example, .NET provides the
    String.Normalize [http://msdn2.microsoft.com/en-us/ebza6ck1.aspx] method).

    Dean.



    This archive was generated by hypermail 2.1.5 : Mon Jun 12 2006 - 21:57:04 CDT