Re: Diacritical marks: Single character or combined character?

From: Jukka K. Korpela <jkorpela_at_cs.tut.fi>
Date: Fri, 06 Dec 2013 16:55:29 +0200

2013-12-06 0:45, Shriramana Sharma wrote:

> In Unicode the characters with precomposed diacritics are given
> "canonical equivalences" to the corresponding sequences of base
> characters followed by separate diacritics. So Unicode-compliant
> parsing tools should not distinguish between the two.

There is no such requirement.

What the standard says, in clause 3.2, item C6, is: “A process shall not
assume that the interpretations of two canonical-equivalent character
sequences are distinct.”

So a program that sends data to another program should not expect that
the recipient will treat U+0101 and U+0061 U+0304 as distinct. But it
may do so, and (as the standard says in this context) it may have valid
reasons to do so.

And the sending program may be based on specific information about the
behavior recipient. Even though you should not assume a priori that “ā”
and “ā” are treated as distinct, you may do so if you actually know that
they will.

Yucca
Received on Fri Dec 06 2013 - 08:57:07 CST

This archive was generated by hypermail 2.2.0 : Fri Dec 06 2013 - 08:57:08 CST