Re: U+0F81 - Unicode 4.0 normalization error (missing exclusion for "Tibetan Vowel Sign Reversed II")

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue May 13 2003 - 11:29:29 EDT

  • Next message: Andrew C. West: "Re: how to sort by stroke (not radical/stroke)"

    Philippe Verdy wrote:
    > I know that the "Derived*.txt" files give some info, but all these files are considered informative and not normative, as they are produced from the base normative files and algorithms documented in the Unicode reference documents. So a safe implementation could ignore all these "Derived*.txt" files and should be able to compute the same set of files.

    This is true. However,
    - some of the derivations are complicated
    - some of the derivation formulas change from one Unicode version to another,
       so if you derive such properties yourself, you will have to change your
       code, while you could just parse the computed results
    - why reinvent the wheel if someone already derived those values for you,
       and maintains their changes?

    Don't take me wrong: When I started parsing UCD files a few years ago, I also took pride in avoiding
    the Derived* files. Over time, enough things in the UCD have changed (especially formulas) that I
    found it to be a maintenance hassle, and changed to just read the Derived* files unless the property
    is derived truly trivially.

    > I may have missed something in the algorithm, because this is the only character in the Unicode4 datafiles that exhibits this problem. ...

    Well, the formula is:
    # Derived Property: Full_Composition_Exclusion
    # Generated from: Composition Exclusions + Singletons + Non-Starter Decompositions

    If you compare the Full_Composition_Exclusion property listing with the one in
    CompositionExclusions.txt, you will find many more differences between the two.

    > Also, I have found that some other recombining algorithms (that transform a NFD to NFC, or NFKD to NFKC), are not implementing the combining exclusion for this character, and allow recombining this character pair because the combination has a starter class 0 (which some may think that it can safely be used as a valid starter or isolated character), or because this is canonically equivalent and produces a shorter string (with the same length justification as recombining some Jamos to produces canonically equivalent but shorter Hangul "syllables")

    If an implementation does not do what the Unicode Normalization Forms specify, then it does not
    implement NFC or NFKC etc. at all. There is a clear specification in UAX #15, there is a sample
    implementation, and there is a conformance test file.

    > Note that the term "syllable" is quite inexact for characters in the large Hangul block, because Hangul syllables may be longer than such character occurences and could include one or more leading L jamos before a LV or LVT "syllable"; also a LV "syllable" could be followed by one or more V jamos and a LVT "syllable" with a L filler...

    This is well known. The Hangul syllable characters are of course syllables, but not the only ones
    possible in Korean. No problem there.

    Best regards,
    markus



    This archive was generated by hypermail 2.1.5 : Tue May 13 2003 - 12:40:46 EDT