Re: Questions about UAX #29 from Karl Williamson on 2011-07-05 (Unicode Mail List Archive)

From: Karl Williamson <public_at_khwilliamson.com>
Date: Tue, 05 Jul 2011 17:31:41 -0600

On 07/05/2011 09:29 AM, Mark Davis ☕ wrote:
> Ah, you're right; I wasn't looking carefully enough at what you wrote.
>
> Yes, an unassigned code point (Cn) is treated as a base character.
>
> Unassigned code points are peculiar beasts, since we don't know really
> how they should behave until (and if) they are assigned. Their treatment
> by the Unicode algorithms varies based on some factors:
>
> * safety - don't have them behave in a way that causes problems
> * foresight - have them behave like the most likely candidate for
> future assignment
> * simplicity - since they shouldn't occur normally in text, don't
> spend too much time worrying about them.
>
> These are not formalized principles, just my observations on how we've
> operated over the years.
>
> Mark
> /— Il meglio è l’inimico del bene —/

Thanks for the answer. It does seem weird to me to treat them as base
characters.

But, I'm wondering then about Cs, isolated surrogates. They also are
treated as base characters. That seems wrong to me. Since UTS18 is
starting to mention the possibility of them in regexes, perhaps this
should be addressed?

Also, my understanding of UAX #44 is that private use code points may or
may not be treated as base characters at the application's discretion.
But this isn't mentioned in UAX#29.
Received on Tue Jul 05 2011 - 18:34:51 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 05 2011 - 18:34:51 CDT