Re: Unicode Regular Expressions, Surrogate Points and UTF-8

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sun, 1 Jun 2014 18:04:57 +0100

On Sun, 1 Jun 2014 08:58:26 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> You misunderstand. In Java, \uD808\uDF45 is the only way to escape a
> supplementary code point, but as long as you have a surrogate pair,
> it is treated as a code point in APIs that support them.

Wasn't obvious that in the following paragraph \uD808\uDF45 was a
pattern?

"Bear in mind that a pattern \uD808 shall not match anything in a
well-formed Unicode string. \uD808\uDF45 specifies a sequence of two
codepoints. This sequence can occur in an ill-formed UTF-32 Unicode
string and before Unicode 5.2 could readily be taken to occur in an
ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular
expression engine, the codepoint sequence <U+D808, U+DF45> cannot
occur in a UTF-16 Unicode string; instead, the code unit sequence <D808
DF45> is the codepoint sequence <U+12345 CUNEIFORM SIGN URU TIMES
KI>."

(It might have been clearer to you if I'd said '8-bit' and '16-bit'
instead of UTF-8 and UTF-16. It does make me wonder what you'd call a
16-bit encoding of arbitrary *codepoint* sequences.)

Richard.
_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Sun Jun 01 2014 - 12:06:12 CDT

This archive was generated by hypermail 2.2.0 : Sun Jun 01 2014 - 12:06:12 CDT