Re: Connector Punctuation and Overlines from Ken Whistler on 2012-03-07 (Unicode Mail List Archive)

From: Ken Whistler <kenw_at_sybase.com>
Date: Wed, 07 Mar 2012 11:39:42 -0800

On 3/6/2012 8:27 PM, fantasai wrote:
> Unicode has a Pc category into which it assigns various low lines:
>
> _ U+005F LOW LINE
> ‿ U+203F UNDERTIE
> ⁀ U+2040 CHARACTER TIE
> ⁔ U+2054 INVERTED UNDERTIE

Those 4 are the actual connectors. The concept arose because of the
peculiar behavior of U+005F LOW LINE, which although classed as
"punctuation", in majority usage doesn't actually serve to delimit things,
but rather is a way of tying them together, particularly for identifier
syntaxes.
For decades now, programmers have been using it as a replacement
for SPACE which allows for visual separation of "words" without the
segmentation effects.

The various TIEs are traditional editing marks which have a comparable
effect. Although they don't occur in regular orthographies and are not
widely used in any syntax, if they *do* occur in digital text, the default
behavior you would want for them would be to keep elements together,
rather than separate them.

> ︳ U+FE33 PRESENTATION FORM FOR VERTICAL LOW LINE
> ︴ U+FE34 PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
> ﹍ U+FE4D DASHED LOW LINE
> ﹎ U+FE4E CENTRELINE LOW LINE
> ﹏ U+FE4F WAVY LOW LINE
> ＿ U+FF3F FULLWIDTH LOW LINE
>
>

Those 6 are completely different. The first 5 are compatibility dreck
coming out of
CNS, and their original intent (most likely) was to represent various styles
of underlining of Chinese text. They cannot be meaningfully used for
that now --
you would do that instead with text styles -- but they are encoded for
roundtrip
conversion to CNS. U+FF3F is just a fullwidth variant from Shift-JIS (etc.)

The reason they are gc=Pc is entirely a normalization consistency issue,
because
they all have compatibility decompositions to U+005F LOW LINE.

> However, the overlines that are almost exactly the same thing, are
> categorized as Po:
>
> ‾ U+203E OVERLINE

The overline isn't typically used to tie anything together. This is
essentially
just a spacing clone of the combining overline.

> ﹉ U+FE49 DASHED OVERLINE
> ﹊ U+FE4A CENTRELINE OVERLINE
> ﹋ U+FE4B WAVY OVERLINE
> ﹌ U+FE4C DOUBLE WAVY OVERLINE

And those 4 are more CNS compatibility dreck, again representing badly
encoded
characters for what should actually be done with text styles.

>
> Is this a bug or a feature? :) Shouldn't they be Pc?

It is a feature. And no, they should not be gc=Pc.

The main algorithmic consequences of gc=Pc are that U+005F (and the kin it
drags along) are Word_Break=ExtendNumLet, which keeps them from
defining default word boundaries, and gc=Pc is included in the
derivation of ID_Continue (and XID_Continue), which keeps them in
identifiers.

I don't know of any particular reason why anyone would want to keep the
spacing overline either inside default word segments or inside identifiers.

--Ken
Received on Wed Mar 07 2012 - 13:44:39 CST

This archive was generated by hypermail 2.2.0 : Wed Mar 07 2012 - 13:44:41 CST