Re: U+0F81 - Unicode 4.0 normalization error (missing exclusion for "Tibetan Vowel Sign Reversed II")

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 13 2003 - 05:58:59 EDT

Next message: Andrew C. West: "Re: visible glyphs for U+2062 and similar characters"

Previous message: John Clews: "ISO 6438, the Niamey keyboard, and ISO/TC46/SC4/WG1"
In reply to: Markus Scherer: "Re: U+0F81 - Unicode 4.0 normalization error (missing exclusion for "Tibetan Vowel Sign Reversed II")"
Next in thread: Markus Scherer: "Re: U+0F81 - Unicode 4.0 normalization error (missing exclusion for "Tibetan Vowel Sign Reversed II")"
Reply: Markus Scherer: "Re: U+0F81 - Unicode 4.0 normalization error (missing exclusion for "Tibetan Vowel Sign Reversed II")"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I know that the "Derived*.txt" files give some info, but all these files are considered informative and not normative, as they are produced from the base normative files and algorithms documented in the Unicode reference documents. So a safe implementation could ignore all these "Derived*.txt" files and should be able to compute the same set of files.

I may have missed something in the algorithm, because this is the only character in the Unicode4 datafiles that exhibits this problem. I think I have made an implementation error, and after studying it, I think it's related to the Combining Class 0 of this character, which has a canonical decomposition, as two combining characters of *distinct* combining classes 129 and 130.

What is the justification of having this character defined with CC=0 ? I think that defining it to 129 would not break the sorting of diacritics after a base letter, and it would avoid such implementation errors when building decomposition tables. So it seems that Unicode chose to give a CC to canonically decomposable characters, but this CC should just be considered as an informative hint, and not normative on those characters (in my opinion the CC of ALL canonically decomposable characters should always be defined as the CC of the first character of its NFD decomposition with combining characters in the NFD string reordered by combining classes, or simply the min CC of its canonical constituents).

>> Shouldn't this character have a combining class 129
>> (i.e. the min CC of the canonically decomposed sequence)
>> to avoid it being considered as a starter ?

Well I will need to perform other checks on the parsing algorithm, to see if the canonical decomposition of a character does not start with a combining character (class > 0), so that the CC of the composed character will be ignored. But I think it just complicates the implementation of normalizer builder algorithms.

This character should behave like the Greek composed pair of diacritics which HAS a non 0 CC defined as the min CC of each composing diacritic (incidently, this combination uses two diacritics that have the same CC, so the CC of the composed character is not ambiguous).

Also, I have found that some other recombining algorithms (that transform a NFD to NFC, or NFKD to NFKC), are not implementing the combining exclusion for this character, and allow recombining this character pair because the combination has a starter class 0 (which some may think that it can safely be used as a valid starter or isolated character), or because this is canonically equivalent and produces a shorter string (with the same length justification as recombining some Jamos to produces canonically equivalent but shorter Hangul "syllables")

Note that the term "syllable" is quite inexact for characters in the large Hangul block, because Hangul syllables may be longer than such character occurences and could include one or more leading L jamos before a LV or LVT "syllable"; also a LV "syllable" could be followed by one or more V jamos and a LVT "syllable" with a L filler...

-- Philippe.
----- Original Message -----
From: "Markus Scherer" <markus.scherer@jtcsv.com>
To: "unicode" <unicode@unicode.org>
Sent: Monday, May 12, 2003 7:02 PM
Subject: Re: U+0F81 - Unicode 4.0 normalization error (missing exclusion for "Tibetan Vowel Sign Reversed II")

> Philippe Verdy wrote:
> >>After some tests I have seen that one character defined in the test file is
> >>excluded from canonical recomposition:
> >>
> >>This normalization test chart:
> >>http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
> >>lists:
> >>
> >>0F81; 0F71 0F80; 0F71 0F80; 0F71 0F80; 0F71 0F80 # (◌ཱྀ; ◌ཱ◌ྀ; ◌ཱ◌ྀ; ◌ཱ◌ྀ;
> >>◌ཱ◌ྀ; ) TIBETAN VOWEL SIGN REVERSED II
> >>
> >>However I don't know why it is not listed in
> >>http://www.unicode.org/Public/4.0-Update/CompositionExclusions-4.0.0.txt
>
> This is because CompositionExclusions.txt does not list all exclusions, but only those that are not
> algorithmically determinable. The Full_Composition_Exclusion property lists them all including
> U+0F81, in DerivedNormalizationProps.txt. See UAX #15 as well as UCD.html and the headers of the
> property files.
>
> Best regards,
> markus
>
> PS: The book is not published yet, but the Unicode 4 data files are final for about a month now.
>
>

Next message: Andrew C. West: "Re: visible glyphs for U+2062 and similar characters"
Previous message: John Clews: "ISO 6438, the Niamey keyboard, and ISO/TC46/SC4/WG1"
In reply to: Markus Scherer: "Re: U+0F81 - Unicode 4.0 normalization error (missing exclusion for "Tibetan Vowel Sign Reversed II")"
Next in thread: Markus Scherer: "Re: U+0F81 - Unicode 4.0 normalization error (missing exclusion for "Tibetan Vowel Sign Reversed II")"
Reply: Markus Scherer: "Re: U+0F81 - Unicode 4.0 normalization error (missing exclusion for "Tibetan Vowel Sign Reversed II")"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue May 13 2003 - 06:57:29 EDT