Re: Unicode Regular Expressions, Surrogate Points and UTF-8

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 4 Jun 2014 00:40:50 +0100

On Tue, 03 Jun 2014 15:06:30 -0700
Xueming Shen <xueming.shen_at_oracle.com> wrote:

> On 06/02/2014 01:01 PM, Richard Wordingham wrote:
> > On Mon, 2 Jun 2014 11:29:09 +0200
> > Mark Davis ☕️<mark_at_macchiato.com> wrote:
> >
> >>> \uD808\uDF45 specifies a sequence of two codepoints.
> >> ​That is simply incorrect.​
> > The above is in the sample notation of UTS #18 Version 17 Section
> > 1.1.
> >
> > From what I can make out, the corresponding Java notation would be
> > \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match
> > in Java, or whether they are even acceptable. The only thing UTS
> > #18 RL1.7 permits them to match in Java is lone surrogates, but I
> > don't know if Java complies.
>
> The notation for "\uD808\uDF45" is interpreted as a supplementary
> codepoint and is represent internally as a pair of surrogates in
> String.
>
> Pattern.compile("\\x{D808}\\x{DF45}").matcher("\ud808\udf45").find());
> -> false
> Pattern.compile("\uD808\uDF45").matcher("\ud808\udf45").find());
> -> true
> Pattern.compile("\\x{D808}").matcher("\ud808\udf45").find());
> -> false
> Pattern.compile("\\x{D808}").matcher("\ud808_\udf45").find());
> -> true

Thank you for providing examples confirming that what in the UTS #18
*sample* notation would be written \uD808\uDF45, i.e. \x{D808}\x{DF45}
in Java notation, matches nothing in any 16-bit Unicode string.

Richard.

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Tue Jun 03 2014 - 18:41:52 CDT

This archive was generated by hypermail 2.2.0 : Tue Jun 03 2014 - 18:41:53 CDT