Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal) from Philippe Verdy on 2011-07-17 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 17 Jul 2011 21:19:04 +0200

2011/7/17 Asmus Freytag <asmusf_at_ix.netcom.com>:
> On 7/17/2011 2:35 AM, Michael Everson wrote:
>>
>> ... invisible and stateful control characters are more expensive than
>> ordinary graphic symbols.
>
> In this case, the expense is so much higher as to rule out such an idea from
> the start.
>
> A./
>
> PS: this doesn't mean that adding graphic symbols is the foregone thing to
> do, only that, if evidence points to the need to address this issue in
> character encoding, then, using graphic symbols is the better way to go
> about it.

Another alternative: instead of encoding separate symbols for each
control, we could as well encode symbols for each character visible in
those symbols.

E.g. ro represent the glyph for the RLO control, we could encode three
characters, one for each of R, L, and O, as DOTTED SYMBOL FOR LATIN
CAPITAL LETTTER R, DOTTED SYMBOL FOR LATIN CAPITAL LETTER L, DOTTED
SYMBOL FOR LATIN CAPITAL LETTER O. These three symbols would have a
representative glyph as the base letter from which they are derived,
within a dotted rectangle.

Then each of them would contextually adopt one of four glyph forms :
the full rectangle, or the rectangle with the left or right side
removed, or both sides removed. The selection would be performed
selectively.

If this is still too complex, because fonts would have to lookup for
lots of pairs, we could instead use the normal latin letters or
symbols, each one modified by an enclosing diacritic encoded after
them (with combining class zero, so that it will not reorder during
normalization, and with general category "Me" for enclosing). In this
case we just need to encode four diacritics :

U+xxx0: ENCLOSING DOTTED SQUARE JOINED ON BOTH SIDES (short alias "EDSB" below)
U+xxx1: ENCLOSING DOTTED SQUARE JOINED ON START SIDE ONLY (alias "EDSS")
U+xxx2: ENCLOSING DOTTED SQUARE JOINED ON END SIDE ONLY (alias "EDSE")
U+xxx3: ENCLOSING DOTTED SQUARE DISJOINED (alias "EDSD")

Then to represent the symbol for RLO in a dotted square, we would use
<R, EDSS, L, EDSB, O, EDSE>.

The only problem with this representation using normal characters is
that fonts (or text renderers) may have to reduce the size of the
glyphs for the characters within these enclosing boxes bor best
display (but this should not be a requirement, there's no fundamental
difference, the only change being the overall widht/height of the
fully composed "symbol").

No complexity, no control used. Only the visible symbol is
represented, not the control that this string represents (there's not
even a requirement that such string represents an actual Unicode
character, it could be used for various symbols, or in texts that need
to encode such enclosing). It can enclose any kind of character of any
script, including diacritics or digits, or non-breaking spaces.

And by extension we could as well as similar diacritics added for
enclosing dotted circles/ovals, or for enclosing squares/rectangles,
of arbitrary lengths. Note that we already have combining characters
for enclosing boxes and circles, so this is not really a new concept
in Unicode.

It's true that such representation using explicitly encoded diacritics
is an alternative to text decorations used in rich text formats. The
encoding would be enough expensive that it would discourage its use
for enclosing arbitrarily long texts (that will certainly better
benefit from an external text decoration of a "span" of text in a
higher-level protocol (such as CSS using "border:" properties).

One caveat, is that such sequence would be collated not as a single
grapheme cluster (is it a problem? this is already the case when a
text already cites the abbreviation "RLO" using plain Latin letters,
possibly surrounded by regular punctuations/symbols or spaces), and
could collide with words appearing directly on each side (only a
problem for word breakers, but if a SPACE character is not separating
the "symbol" from the surrounding text, we can still use a ZERO WIDTH
SPACE to separate them).

I see also no defect of those sequences are not recognized as
"symbols", but as words. It would even benefit to spellers, that would
easily detect that the enclosed letters are in fact considered like
abbreviations, where each grapheme is decorated by these diacritics.

Note: in a previous message I already spoke as another alternative,
using start and end punctuations (i.e. general category "Ps" and "Pe")
that would be normal base characters (similar to parentheses, brackets
and braces), but the difficulty is to have them connect graphically on
top of sequences of separate grapheme clusters.

-- Philippe.
Received on Sun Jul 17 2011 - 14:21:23 CDT

This archive was generated by hypermail 2.2.0 : Sun Jul 17 2011 - 14:21:24 CDT