Re: Questions about UAX #29

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Tue, 5 Jul 2011 08:29:37 -0700

Ah, you're right; I wasn't looking carefully enough at what you wrote.

Yes, an unassigned code point (Cn) is treated as a base character.

Unassigned code points are peculiar beasts, since we don't know really how
they should behave until (and if) they are assigned. Their treatment by the
Unicode algorithms varies based on some factors:

   - safety - don't have them behave in a way that causes problems
   - foresight - have them behave like the most likely candidate for future
   assignment
   - simplicity - since they shouldn't occur normally in text, don't spend
   too much time worrying about them.

These are not formalized principles, just my observations on how we've
operated over the years.

Mark
*— Il meglio è l’inimico del bene —*

On Mon, Jul 4, 2011 at 20:17, Karl Williamson <public_at_khwilliamson.com>wrote:

> On 07/03/2011 05:52 PM, Mark Davis ☕ wrote:
>
>>
>>
>> Mark
>> /— Il meglio è l’inimico del bene —/
>>
>>
>> On Sat, Jul 2, 2011 at 14:58, Karl Williamson <public_at_khwilliamson.com
>> <mailto:public_at_khwilliamson.**com <public_at_khwilliamson.com>>> wrote:
>>
>> I have two questions about this.
>>
>> 1) In UAX #44, it says for information about the Grapheme_Base
>> property, to see UAX #29, but that document doesn't mention this
>> property.
>>
>>
>> The documentation on Grapheme_Base in #44 is obsolete. Grapheme_Base has
>> not been used in the specification of grapheme clusters since (I
>> believe) Unicode 3.2.
>>
>>
>> 2) The definition in UAX #29 for both legacy and extended grapheme
>> clusters effectively says that any Gc=Cn code points followed by any
>> number of grapheme_extend code points is a grapheme cluster. Is
>> that what is meant? I notice that Grapheme_Base excludes Cn code
>> points.
>>
>>
>> It doesn't say that. If you had the sequence <Control Extend>, you'd
>> have a break between them according to the following rule:
>> GB4. ( Control | CR | LF ) ÷
>>
>> It would result in two (degenerate) grapheme clusters.
>>
>> We need to fix the documentation to make this clearer. Could you let me
>> know what let you to think that "any Gc=Cn code points followed by any
>> number of grapheme_extend code points is a grapheme cluster" so that we
>> can clarify that?
>>
>
> It says that an extended grapheme cluster matches this:
> ( CRLF
> | Prepend* ( Hangul-syllable | !Control )
> ( Grapheme_Extend | Spacing_Mark)*
> | . )
>
> That tells me that one option for a grapheme cluster is a !Control followed
> by any number of Grapheme_Extends.
>
> Lower down it defines "Control" as
> "General_Category = Line Separator (Zl), or
> General_Category = Paragraph Separator (Zp), or
> General_Category = Control (Cc), or
> General_Category = Format (Cf)
> and not U+000D CARRIAGE RETURN (CR)
> and not U+000A LINE FEED (LF)
> and not U+200C ZERO WIDTH NON-JOINER (ZWNJ)
> and not U+200D ZERO WIDTH JOINER (ZWJ)"
>
> By that definition of Control, all Gc=Cn code points are !Control.
> Therefore a grapheme cluster can be a Cn followed by any number of
> Grapheme_Extends
>
>
>> Grapheme_Extend, on the other hand, is exactly equivalent to
>> Grapheme_Cluster_Break=Extend.
>>
>>
>
Received on Tue Jul 05 2011 - 10:33:10 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 05 2011 - 10:33:11 CDT