Re: Script Names

From: Mark Davis (markdavis@ispchannel.com)
Date: Mon May 22 2000 - 10:05:13 EDT


I discussed some of these topics a bit in the previous message to M. Leca.

You bring up a good point about normalization. The goal in composing this list was that no characters change script under NFC or NCD (if the script of combining marks is ignored). That being said, I have not yet mechanically verified that.

It was not a goal to do that for NFKC or NFKD, since that would widen it out to symbols. Although there are advantages to only applying it to letters, if there are good reasons for changing this policy we can do it -- the goal is to have it be an informative property that is as useful as possible over a range of applications (and noting circumstances where it needs to be customized for specific requirements).

Mark

Marco Cimarosti wrote:

> Mark Davis wrote:
> >>There is a new proposed technical report on the Unicode site.
> >>document: http://www.unicode.org/unicode/reports/tr24/
>
> Good job! A very useful piece of information.
>
> But how does this combine with Normalization Forms?
>
> A brutal character-by-character application of the Script property from this file would achieve different results when the same grapheme is expressed in precomposed or decomposed form.
>
> E.g.: U+00C0 (LATIN CAPITAL LETTER A WITH GRAVE) is "script = Latin", i.e. the letter and the accent are both "script = Latin, Latin". However, the equivalent decomposed sequence U+0041, U+0300 (LATIN CAPITAL LETTER A, COMBINING GRAVE ACCENT) is "script = Latin, Common".
>
> To remove this ambiguity, why not assuming that a combining character has the same script property as the base character it is applied to?
>
> This would, however, open the way to some tricky facets (although not necessarily wrong):
>
> - The "script" property of shared diacritics (e.g. U+0300 COMBINING GRAVE ACCENT) would be variable and context-dependent.
>
> - Script-specific combining marks could get assigned to a different script, if used in a strange context. E.g.U+093E (DEVANAGARI VOWEL SIGN AA) would be "script = Bengali" when following U+0995 (BENGALI LETTER KA).
>
> A different approach could be to assume a particular normalization (e.g. Normalization Form D), and remove all derivable characters from Script.txt.
>
> _ Marco
> ______________________________________________
> FREE Personalized Email at Mail.com
> Sign up at http://www.mail.com/?sr=signup



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT