Re: UnicodeData.txt questions

From: Ken Whistler (kenw@sybase.com)
Date: Fri May 27 2011 - 13:47:23 CDT

  • Next message: Vinodh Rajan: "Lao Script Block - Missing Letters"

    On 5/27/2011 10:09 AM, Chris Clark wrote:
    > I've been looking at the version 6.0 UnicodeData.txt data file at
    > http://www.unicode.org/Public/UNIDATA/ and I can't find a
    > UnicodeData.html to go with it. For older versions there is a html
    > explanation file, e.g.
    > http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html
    >
    > Is UnicodeData.txt described else where now?

    You're a couple generations behind. UnicodeData.html was replaced by
    UCD.html
    for several versions.

    Now, the documentation about UnicodeData.txt (and the rest of the data
    files of
    the Unicode Character Database (UCD)) is gathered in UAX #44:

    http://www.unicode.org/reports/tr44/

    When looking for the documentation about any particular version of the UCD,
    whether current or earlier, always start from the component listing for
    that version. The component listings give explicit links to the
    documentation
    for each version. Start from:

    http://www.unicode.org/versions/enumeratedversions.html

    which is also accessible from the home page on the link "Archive of Unicode
    Versions" in the menus.

    >
    > I'm finding the notation for ranges in UnicodeData.txt a little
    > non-intuitive, e.g. the omitted Hangul Syllables has 2 entries:
    >
    > AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
    > D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
    >
    > Would it make more sense to have a single entry? Something along the
    > lines of:
    >
    > AC00..D7A3;<RANGE: Hangul Syllables>;Lo;0;L;;;;;N;;;;;
    >
    > A single line would be easier to detect and deal with when parsing the
    > file. No need to maintain processing state between each line.

    That existing notation is a bit awkward to parse, but is left that way
    in part
    because it has *always* been that way. Changing it to accommodate some
    new parsers would just break old parsers.

    >
    > http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html does
    > explicitly list the ranges of characters (which I find REALLY useful
    > and clear), it also mentions that CJK Ideographs and Hangul Syllables
    > are omitted as they can be easily derived. It then links to Unicode
    > Standard and Unicode Standard Annex #15 (i.e.
    > http://unicode.org/reports/tr15/). I can find the Hangul algorithm at
    > http://unicode.org/reports/tr15/#Hangul but CJK Ideographs are not
    > covered. I know this is a pretty obvious algorithm but I was expecting
    > to see it explicitly detailed.

    See UAX #44 for current information.

    The explicit ranges of characters defined by ranges in UnicodeData.txt
    is not
    listed in UAX #44, but they are trivially derivable from UnicodeData.txt
    itself:

    % grep First UnicodeData.txt
    % grep Last UnicodeData.txt

    will get you all of them for any particular version of UnicodeData.txt.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri May 27 2011 - 13:49:52 CDT