UnicodeData.txt questions

From: Chris Clark (Chris.Clark@ingres.com)
Date: Fri May 27 2011 - 12:09:16 CDT

  • Next message: Ken Whistler: "Re: UnicodeData.txt questions"

    I've been looking at the version 6.0 UnicodeData.txt data file at
    http://www.unicode.org/Public/UNIDATA/ and I can't find a
    UnicodeData.html to go with it. For older versions there is a html
    explanation file, e.g.

    Is UnicodeData.txt described else where now?

    I'm finding the notation for ranges in UnicodeData.txt a little
    non-intuitive, e.g. the omitted Hangul Syllables has 2 entries:

        AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
        D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;

    Would it make more sense to have a single entry? Something along the
    lines of:

        AC00..D7A3;<RANGE: Hangul Syllables>;Lo;0;L;;;;;N;;;;;

    A single line would be easier to detect and deal with when parsing the
    file. No need to maintain processing state between each line.

    http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html does
    explicitly list the ranges of characters (which I find REALLY useful and
    clear), it also mentions that CJK Ideographs and Hangul Syllables are
    omitted as they can be easily derived. It then links to Unicode Standard
    and Unicode Standard Annex #15 (i.e. http://unicode.org/reports/tr15/).
    I can find the Hangul algorithm at
    http://unicode.org/reports/tr15/#Hangul but CJK Ideographs are not
    covered. I know this is a pretty obvious algorithm but I was expecting
    to see it explicitly detailed.

    I went ahead and implemented Python versions of both, i.e. java->python
    for Hangul and a new CJK name function. I'm not sure if they are any use
    to anyone but me but I thought I'd share them just in case, see end of
    mail for inline version. It was tested with Python 2.x and Jython 2.5.2
    (and it will probably work with 3.x too)


    class MyBaseException(Exception):
        '''Base exception'''

    class IllegalArgumentException(MyBaseException):
        '''Java IllegalArgumentException'''

    # Hangul constants
    SBase = 0xAC00
    #LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7,
    LCount = 19
    VCount = 21
    TCount = 28
    NCount = VCount * TCount # 588
    SCount = LCount * NCount # 11172

    JAMO_L_TABLE = [
        "G", "GG", "N", "D", "DD", "R", "M", "B", "BB",
        "S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H"

    JAMO_V_TABLE = [
        "A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O",
        "WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI",
        "YU", "EU", "YI", "I"

    JAMO_T_TABLE = [
        "", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM",
        "LB", "LS", "LT", "LP", "LH", "M", "B", "BS",
        "S", "SS", "NG", "J", "C", "K", "T", "P", "H"

    def getHangulName(single_unicode_character):
        """Python straight conversion of Java reference
        implementation getHangulName() from
        Non-pythonic, no change to names/code unless for syntax reasons
            single_unicode_character - single Unicode character
        # add assert unicode
        s = ord(single_unicode_character)
        SIndex = s - SBase;
        if (0 > SIndex or SIndex >= SCount):
            raise IllegalArgumentException("Not a Hangul Syllable: " + s);
        LIndex = SIndex / NCount;
        VIndex = (SIndex % NCount) / TCount;
        TIndex = SIndex % TCount;
        return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex] \
          + JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex];

    def getCJKName(single_unicode_character):
        """names and functionality based on
        implementation of getHangulName() from
            single_unicode_character - single Unicode character
        # add assert unicode
        s = ord(single_unicode_character)
        SIndex = s - SBase;
        # U+4E00 .. U+9FA5
        if (0x4E00 < s > 0x9FA5):
            raise IllegalArgumentException("Not a CJK Unified Ideograph: " + s);
        LIndex = SIndex / NCount;
        VIndex = (SIndex % NCount) / TCount;
        TIndex = SIndex % TCount;
        return "CJK UNIFIED IDEOGRAPH-%x" % s

    This archive was generated by hypermail 2.1.5 : Fri May 27 2011 - 13:09:06 CDT