Re: Rendering Raised FULL STOP between Digits from Asmus Freytag on 2013-03-22 (Unicode Mail List Archive)

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Fri, 22 Mar 2013 09:01:51 -0700

On 3/22/2013 4:08 AM, Philippe Verdy wrote:
> 2013/3/22 Asmus Freytag <asmusf_at_ix.netcom.com>:
>> If you need to annotate text with the results of semantic analysis as
>> performed by a human reader, then you either need XML, or some other format
>> that can express that particular intent.
> Absolutely NO. If this encodes semantics, this is part of plain text,

I think we are on a different page here. In some ways the Unicode term
"semantics" is very misleading in this context. What Unicode means by
this fancy term is the character's identity - not it's use.

If you use a colon to mark abbreviation (as in Swedish) you are using a
colon - the use may be very different from how a colon is used
elsewhere, but it does not create a new character.

Unicode does not encode the semantics of a sentence or word, but
provides a string of characters of known identity that lets a human
reader determine the semantics of that sentence or word as unambiguously
as if that sentence had been reproduced by analog means - that's, in a
nutshell, what Unicode attempts to do.

> and not part of an upper layer protocol. Notably these characters
> should be used to alter de default (ambiguous) character properties of
> the characters they modify, and notably to give them the semantics
> needed for existing Unicode algorithms (general categories:
> punctuation, diacritic; word-breaking properties...)

Character properties define the *default* behavior of a given
character. There are many examples, especially in the context of
punctuation where a character can have different uses. Each use may need
a different treatment by readers (or algorithms).

To handle some behaviors, you may need complex processing (natural
language processing) that mimics what a human reader can do.

There are a few exceptions where characters are disunified based on
properties - the most principled of these involve properties that can't
be modified, such as the bidi property. There are about a dozen
characters that look entirely alike (by design and derivation) yet have
been disunified based on bidi properties - because bidi properties
cannot be overridden.

There are a few other cases, usually where a character can be both
letter and punctuation where such disunifications were made based on
overridable properties. Here the reason was that this distinction has
such a wide reach (and hat to be applied by many basic algorithms) that
breaking the principle of single character identity can be justified.

If a problem is sufficiently severe, then you'd possibly have
justification to disunify. If not, then the answer would be outside the
scope of character encoding.

>
> adding new variants of existing characters like what was done
> specifically for maths is not a stabl long term solution; solutions
> similar to variant selectors however are much more meaningful, and
> will allow for example to make the distinction between a MIDDLE DOT
> punctuation and an ANO TELEIA, and will also allow them to be rendered
> differently (even if there's no requirement to do so).
>
> This is absolutely not "pseudo-coding".
>
"Pseudo coding" refers to making distinctions between characters not on
their basic encoding, but by means of "attributes" such as the selectors
you are suggesting.
Received on Fri Mar 22 2013 - 11:03:15 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 22 2013 - 11:03:15 CDT