Re: Best practices for replacing UTF-8 overlongs

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Mon, 19 Dec 2016 16:17:29 -0800

On Mon, Dec 19, 2016 at 3:04 PM, Karl Williamson <public_at_khwilliamson.com>
wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 should be
> replaced by 2 replacement characters under best practices, or that E0 80 80
> should also be replaced by 2. Each sequence was legal in early Unicode
> versions, and it seems that it would be best to treat them as each a single
> sequence, replacing by a single replacement character.
>

Looks like the ICU converters and string-iteration macros do what you
expect (if I understand your expectations).

markus
Received on Mon Dec 19 2016 - 18:17:51 CST

This archive was generated by hypermail 2.2.0 : Mon Dec 19 2016 - 18:17:51 CST