Re: New Character Property for Prepended Concatenation Marks

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 26 Nov 2015 12:08:43 +0100

The related definition for extended grapheme clusters says:

( CRLF
| *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
           ( Grapheme_Extend | *SpacingMark* )*
| . )

However I do not understand why it may include only one Hangul-Syllable
when applying prepended concatenation marks. And if the definition excludes
whitespaces, nothing prevents it to extend to arbitrary sequences of
letters/digits/symbols/punctuations (this could span very long sequences of
sinograms, or other letters from scripts that do not use whitespaces as
word separators. Even in the Latin script it would extend to the
punctuation signs that may follow any word, or to an entire mathematical
formula such as "1+2*3" but not "sin x"...

2015-11-26 11:41 GMT+01:00 Philippe Verdy <verdy_p_at_wanadoo.fr>:

> The root sign is much more complex than just prepending specific sequences
> of characters (in a limited set): when it embeds some "text", it can it it
> recursively and unless you use additional parentheses for the linear
> presentation, it highly depends on the 2D layout of its operand
> (additionally it could be prefixed itself by a superscripted radix value).
> Leave it alone: the 2D layout (even in the linear presentation using
> parentheses where needed) will be mapped using an additional mathematical
> presentaiton layer and notation.
> For the basic plain-text, the root sign will just stay alone without using
> any complex layout, and its operand will simply follow it (using
> parentheses where needed) without specific rendering.
>
> ----
>
> However the proposal for these prepended concatenation marks does not give
> any hint about how to compute the extent of the following clusters
> above/over/below/around which they will apply (do they extend over only
> letters/digits, but not whitespaces or punctuation signs including
> abbreviation marks?
>
> For me this kind of visual interaction should be more explicitly delimited
> using special marks (working like invisible parentheses) : the absence of
> these special marks immediately after the prepended concatenation mark
> should mean that they will not extend after the next (non-whitespace)
> cluster.
>
>
> So:
>
> - <ARABIC NUMBER SIGN, SPACE, ARABIC DIGIT ONE> will display the isolated
> number sign WITHOUT extending to the following space and digit
>
> - <ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO> will apply the
> number sign ONLY to the first digit
>
> - <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, ARABIC DIGIT
> TWO, END OF SEQUENCE> will apply the number sign to the two digits
>
> - <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, FULL STOP, ARABIC
> DIGIT TWO, END OF SEQUENCE> will apply the number sign to the two digits
> and the separating full stop
>
> - <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, SPACE, ARABIC
> DIGIT TWO, END OF SEQUENCE> will apply the number sign to the two digits
> and the separating space
>
> - <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, NEWLINE, ARABIC
> DIGIT TWO, END OF SEQUENCE> will apply the number sign to the first digit
> only before the newline control, the second digit will appear on the next
> line outside the number sign complex cluster, the second control will be
> ignored (or would display with a "visible control glyph".
>
> Without the <START OF SEQUENCE> and <END OF SEQUENCE> special controls,
> it will be necessary anyway to define specific enumerations of characters
> that can be part of the sequence on which the prepended mark will apply.
>
> Another complication: when such prepended sequences are recognized, there
> are specific tunings to apply in line-breaking algorithms.
>
> Word breaking algorithms may not need specific changes if the enumerations
> of characters that can be part of the prepended sequence cannot contain any
> word-breaking character. That's why I suggested that, by default, such
> enumerations should include only letters and digits but not whitespace (and
> probably not punctuation signs such as the dot), plus their additional
> combining marks.
>
> - For Arabic U+0600, U+0601 and U+0605 (TUS-9.2, page 374), the
> enumeration is supposed to contain only Arabic-Indic or extended Arabic
> -Indic digits, but I wonder if it should not include as well number
> separators, or even Arabic-European digits.
> - Same remark for the Kaithi number sign U+110BD.
> - For Syriac U+070F (TUS-9.3, pages 390-391), the enumeration is not so
> obvious (all Syriac "letter-numbers"?)
>
> There are also similar characters in other scripts not listed: one example
> with the Cyrillic hundred-thousands/millions marks U+0488..U+0489 which
> enclose possibly more than one digits (currently encoded as combining marks
> applicable to only one digit?); another with the Egyptian Hieroglyph
> honorific "Cartouche" which encloses the name of a king; other examples
> possible as well in other Asian scripts for honorific marks.
>
> The system using explicitly delimited sequences would work as well with
> the Latin script for some honorific "decorators" which are not just
> ligatures, e.g. for the name of God or Jesus-Christ (which may also be
> themselves abbreviated), including for Quranic transcriptions.
>
> -- Philippe.
>
>
> 2015-11-26 9:10 GMT+01:00 "Jörg Knappen" <jknappen_at_web.de>:
>
>> I wonder how this concept relates to mathematical notation, especially
>> the root sign.
>>
>> --Jörg Knappen
>>
>> *Gesendet:* Mittwoch, 25. November 2015 um 23:34 Uhr
>> *Von:* announcements_at_unicode.org
>> *An:* announcements_at_unicode.org
>> *Betreff:* New Character Property for Prepended Concatenation Marks
>>
>> The Unicode Technical Committee is seeking feedback on a proposal to
>> define a new character property for the class of *prepended
>> concatenation marks*, also referred to as *prefixed format control
>> characters* or, more generically, as subtending marks. Characters in
>> that class include U+0600 ARABIC NUMBER SIGN and U+06DD ARABIC END OF AYAH.
>> The new property, named Prepended_Concatenation_Mark and targeted for
>> Unicode 9.0, would provide a mechanism to handle subtending marks
>> collectively via properties rather than by hardcoded enumeration. A
>> detailed description of the issue and how to provide feedback are given in Public
>> Review Issue #310 <http://www.unicode.org/review/pri310/>.
>>
>> http://blog.unicode.org/2015/11/new-character-property-for-prepended.html
>>
>>
>
>

picture
Received on Thu Nov 26 2015 - 05:10:03 CST

This archive was generated by hypermail 2.2.0 : Thu Nov 26 2015 - 05:10:03 CST