Re: Can the combining diacritical marks combine with any base character?

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 11 Feb 2013 02:34:06 +0100

The same would be true if you wrote a combining character after an
apostrophe-quote or a double quote within a XML or HTML attribute. NFC
will not combine it with these syntaxic delimiters.
However this would not be true if you wrote some combining characters
after an equal sign (for HTML and XML the solution is to write these
combining characters (that are part of an attribute value) between
quotation marks (mandatory in XML).

The problem however is within text editors, and for such things, it is
probably better to encode a leading combining character as a numeric
character entity (this is only needed when editing the XML/HTML file
manually, as an HTML or XML generator, meant to be parsed by a machine
and not reedited by a human, may safely ignore this.

However this has a consequence : you cannot blindly normalize an
HTML/XML source file as if it was a "flat" plain-text, normalization
of these files should be performed on the parsed DOM, on individual
text elements or individual attribute values, or individual element or
attribute names, or on individual comments (and it should be avoided
on parsing instructions). Similar considerations would apply to source
files in other progrmming languages (such as Javascript, PHP or C++
source files containing literal strings), which should not be
normalized without knowledge of the syntax of these languages.

Syntaxic problems created by normalization may be more serious in some
file formats such as data files where spaces are used as filed
separators : they are also not really flat plain-text files
(normalization is only safe within each individual field, i.e. on the
parsed elements of the document).

In addition there may be data validity constraints in those languages,
even if for the Unicode standard itself, these texts are still valid :
these extra constraints are out of scope of the standard itself.
However to help defining some validity rules for programming
languages, the Unicode standard suggests rules that allow programming
languages to define identifiers which can be "safely"
internationalized, by adding sufficient constraints where
normalization should not a be problem for parsing thoese languages,
but Unicode does not define how these identifiers will collate and
match (some programming languages may ignore some case differences,
however most of them will not treat identifiers that are canonically
equivalent for Unicode as being equivalent for these language parsers,
so these language will still define their own rules for valid and
unique identifiers).

2013/2/11 David Starner <prosfilaes_at_gmail.com>:
> On Sun, Feb 10, 2013 at 3:46 PM, Costello, Roger L. <costello_at_mitre.org> wrote:
>> Hi Folks,
>>
>> Can the combining diacritical marks combine with any base character?
>
> Yes.
>
>> If yes, wouldn't normalizing this:
>>
>> <comment>(U+0303)
>>
>> to NFC result in converting the XML start tag into non-well-formed XML? (It is not well-formed because there is no longer a '>' character after the tag name; rather, there is a '>' character with a tilde on top.)
>
> Normalizing it to NFC would change nothing, since there's no
> precomposed '>' + diacritic characters.
Received on Sun Feb 10 2013 - 19:38:04 CST

This archive was generated by hypermail 2.2.0 : Sun Feb 10 2013 - 19:38:05 CST