Re: Unicode Regular Expressions, Surrogate Points and UTF-8

From: Philippe Verdy <>
Date: Sat, 31 May 2014 13:21:23 +0200

I think Richard dd not speak aout that, but about the behavior of a
matchier that would start parsing a text using the wrong guessed encoding.
e gave the exampe of a valid CESU-8 text containing with U+10000: when
reading it incorrectly as UTF-8, the parser gets the 4 invalid sequences:
CESU-8 cannot be easily detected at start of the stream with the encoding
of byte order mark U+FEFF.

However CESU-8 can be detected by the initial encoding of another byte
order mark U+1FFFE (which is a non-character that MUST be stripped once
detected from the parsed stream of code points) However, documents starting
by this non-cahracters are supposed to be non-interoperable by definition
even though the presence of that special byte order mark would be very safe
to secure CESU-8 and discriminate it from UTF-8.

2014-05-31 1:15 GMT+02:00 Markus Scherer <>:

> If you use Unicode 16-bit strings, it's easy to "pass through" unpaired
> surrogates and treat them like code points; it's often not productive or
> necessary to check for them all the time, that is, to be strict about
> UTF-16.
> On the other hand, I don't think anyone expects you to support invalid
> UTF-8, and especially not to support any and all Unicode 8-bit strings (see
> Unicode 3.9 Unicode Encoding Forms for what I mean here).
> If you find UTS #18 unclear or misleading, I suggest you submit feedback
> pointing out specific text issues.
> markus
> _______________________________________________
> Unicode mailing list

Unicode mailing list
Received on Sat May 31 2014 - 06:22:46 CDT

This archive was generated by hypermail 2.2.0 : Sat May 31 2014 - 06:22:47 CDT