Re: Where is the First> Last> convention documented?

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 06 2007 - 15:35:44 CDT

  • Next message: Doug Ewell: "Re: [icu-support] complete binary/utf mapping"

    > This leads to another issue in the database format, which I prefer to
    > discuss here first: why are they ranges in UnicodeData.txt rather than
    > explicit records for every character?

    Several reasons.

    UnicodeData.txt was the very first data file. Its format was
    created ad hoc, somewhere in the 1993 timeframe, in support of
    the publication and implementation of Unicode 1.1, by developers
    at Taligent.

    It was released publicly with some other new data files for
    Unicode 2.0 in 1996.

    One of the things to note about that is that when the Hangul
    syllables were recoded between Unicode 1.1 and Unicode 2.0
    (and expanded from 6,656 to 11,172 in number), the UTC (and WG2,
    for that matter) made an explicit decision to create Hangul
    syllable names for the new set algorithmically. For the
    UTC the decompositions could also be done algorithmically.
    So it was a desired and *mandated* decision by the UTC to
    *remove* the explicit Hangul records from UnicodeData.txt and
    to note that those values were all algorithmically derived
    instead.

    UnicodeData-1.1.5.txt didn't have the First/Last convention.
    Instead, it had a single entry for Han characters, to wit:

    4E00;<CJK IDEOGRAPH REPRESENTATIVE>;Lo;0;L;;;;;N;;;;;

    UnicodeData-2.0.14.txt, for Unicode 2.0, innovated the
    First/Last convention for CJK and other ranges, because it
    became apparent that there were other ranges to document
    besides the initial CJK URO, and because having the start
    and stop values for the range was important.

    As Eric hinted, back in the mid-90's data size was more of
    an issue. With 70% of Unicode consisting of Han characters,
    and with the UnicodeData.txt values redundant across all
    of them, a simple database normalization decision that
    reduces all those records to a few range points was an
    obvious way to go.

    Furthermore, UnicodeData.txt all along has been maintained
    with fairly simple tools and diffs. Unlike Unihan.txt, which
    is actually a report generated from a relational database,
    UnicodeData.txt *is* the data source itself. It has had to
    be meticulously maintained in many, many deltas going back
    over a decade now, and having all of those versions bloated
    with massive amounts of redundant CJK and Hangul records that
    never change would simply have been inefficient and useless.

    > Being explicit would avoid
    > generating names for the implicit records (something which is not
    > obvious and not well documented, IMHO).

    Well, if you go to ISO/IEC 10646, clause 28 is "Character names
    and annotations", and in that clause, subclause 28.2 "Character
    names for CJK Ideographs" gives the rules for naming of
    CJK unified and compatibility ideographs, and subclause 28.3
    "Character names and annotations for Hangul syllables" does
    the same for Hangul syllables. It is not as if anyone who
    reads the standard could miss it.

    The Unicode Standard (by necessity) follows the same rules,
    and documents them in Chapter 17 "Code Charts", with Section 17.2
    "CJK Unified Ideographs" spelling out the CJK rule and
    Section 17.3 "Hangul Syllables" noting the Hangul name rule
    and pointing to Section 3.12 "Conjoining Jamo Behavior" for
    the details of the algorithm.

    But I grant that perhaps this is not obvious to the
    casual observer of Unicode, as opposed to folks who have
    been working on the standard for years.

    Perhaps sticking something in the FAQ on Chinese, Japanese
    and Korean issues would help:

    http://www.unicode.org/faq/han_cjk.html

    Or perhaps a FAQ just dedicated to Unicode character names
    might be in order. In any case, make specific suggestions,
    and perhaps the situation can be improved.

    >
    > Or, a variant, why not a DervivedUnicodeData.txt file with the
    > all the characters?

    Eric Muller pointed out the ultimate answer to this, which
    is a fully rationalized and complete XML representation of
    *all* of the Unicode Character Database.

    Note, however, as regards names in particular, that some
    Unicode characters (e.g., noncharacters, private-use characters) don't
    have character names, so any notion of simply expanding the
    conventions of UnicodeData to all assigned code points gets
    you into the same kind of trouble that including control
    codes in the current UnicodeData.txt does -- you may give
    up one set of arbitrary conventions (First/Last range
    compression), but end up having to invent other arbitrary
    conventions for special cases.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Sep 06 2007 - 15:39:01 CDT