Re: Unicode Regular Expressions, Surrogate Points and UTF-8

From: Richard Wordingham <>
Date: Sat, 31 May 2014 13:11:03 +0100

On Sat, 31 May 2014 13:21:23 +0200
Philippe Verdy <> wrote:

> However CESU-8 can be detected by the initial encoding of another byte
> order mark U+1FFFE (which is a non-character that MUST be stripped
> once detected from the parsed stream of code points) However,
> documents starting by this non-cahracters are supposed to be
> non-interoperable by definition even though the presence of that
> special byte order mark would be very safe to secure CESU-8 and
> discriminate it from UTF-8.

Where is this tagging defined?

It is in general not true that non-characters must be stripped on
input. That would be highly inappropriate in a conversion program that
transformed between UTFs. Also, the collations defined in CLDR Version
23 file collation/zh.xml would be severely damaged if the
non-characters were stripped out. In version 24 and later the file
uses a different syntax and doesn't contain non-characters.

Unicode mailing list
Received on Sat May 31 2014 - 07:13:11 CDT

This archive was generated by hypermail 2.2.0 : Sat May 31 2014 - 07:13:11 CDT