Re: Unicode Regular Expressions, Surrogate Points and UTF-8

From: Xueming Shen <xueming.shen_at_oracle.com>
Date: Tue, 03 Jun 2014 15:06:30 -0700

On 06/02/2014 01:01 PM, Richard Wordingham wrote:
> On Mon, 2 Jun 2014 11:29:09 +0200
> Mark Davis ☕️<mark_at_macchiato.com> wrote:
>
>>> \uD808\uDF45 specifies a sequence of two codepoints.
>> ​That is simply incorrect.​
> The above is in the sample notation of UTS #18 Version 17 Section 1.1.
>
> From what I can make out, the corresponding Java notation would be
> \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match in
> Java, or whether they are even acceptable. The only thing UTS #18
> RL1.7 permits them to match in Java is lone surrogates, but I don't
> know if Java complies.

The notation for "\uD808\uDF45" is interpreted as a supplementary codepoint and
is represent internally as a pair of surrogates in String.

   Pattern.compile("\\x{D808}\\x{DF45}").matcher("\ud808\udf45").find()); -> false
   Pattern.compile("\uD808\uDF45").matcher("\ud808\udf45").find()); -> true
   Pattern.compile("\\x{D808}").matcher("\ud808\udf45").find()); -> false
   Pattern.compile("\\x{D808}").matcher("\ud808_\udf45").find()); -> true

-Sherman

> All UTS #18 says for sure about regular expressions matching code units
> is that they don't satisfy RL1.1, though Section 1.7 appears to ban
> them when it says, "A fundamental requirement is that Unicode text be
> interpreted semantically by code point, not code units". Perhaps it's
> a fundamental requirement of something other than UTS #18. I thought
> matching parts of characters in terms of their canonical equivalences
> was awkward enough, without having the additional option of matching
> some of the code units!
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Tue Jun 03 2014 - 18:13:54 CDT

This archive was generated by hypermail 2.2.0 : Tue Jun 03 2014 - 18:13:54 CDT