UnicodeData.txt questions

From: Chris Clark (Chris.Clark@ingres.com)
Date: Fri May 27 2011 - 12:09:16 CDT

Next message: Ken Whistler: "Re: UnicodeData.txt questions"

Previous message: announcements@unicode.org: "PRI #184: Proposed Update UTS #37, Unicode Ideographic Variation Database"
Next in thread: Ken Whistler: "Re: UnicodeData.txt questions"
Reply: Ken Whistler: "Re: UnicodeData.txt questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I've been looking at the version 6.0 UnicodeData.txt data file at
http://www.unicode.org/Public/UNIDATA/ and I can't find a
UnicodeData.html to go with it. For older versions there is a html
explanation file, e.g.
http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html

Is UnicodeData.txt described else where now?

I'm finding the notation for ranges in UnicodeData.txt a little
non-intuitive, e.g. the omitted Hangul Syllables has 2 entries:

AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;

Would it make more sense to have a single entry? Something along the
lines of:

AC00..D7A3;<RANGE: Hangul Syllables>;Lo;0;L;;;;;N;;;;;

A single line would be easier to detect and deal with when parsing the
file. No need to maintain processing state between each line.

http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html does
explicitly list the ranges of characters (which I find REALLY useful and
clear), it also mentions that CJK Ideographs and Hangul Syllables are
omitted as they can be easily derived. It then links to Unicode Standard
and Unicode Standard Annex #15 (i.e. http://unicode.org/reports/tr15/).
I can find the Hangul algorithm at
http://unicode.org/reports/tr15/#Hangul but CJK Ideographs are not
covered. I know this is a pretty obvious algorithm but I was expecting
to see it explicitly detailed.

I went ahead and implemented Python versions of both, i.e. java->python
for Hangul and a new CJK name function. I'm not sure if they are any use
to anyone but me but I thought I'd share them just in case, see end of
mail for inline version. It was tested with Python 2.x and Jython 2.5.2
(and it will probably work with 3.x too)

Chris

class MyBaseException(Exception):
'''Base exception'''

class IllegalArgumentException(MyBaseException):
'''Java IllegalArgumentException'''

# Hangul constants
SBase = 0xAC00
#LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7,
LCount = 19
VCount = 21
TCount = 28
NCount = VCount * TCount # 588
SCount = LCount * NCount # 11172

JAMO_L_TABLE = [
"G", "GG", "N", "D", "DD", "R", "M", "B", "BB",
"S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H"
]

JAMO_V_TABLE = [
    "A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O",
    "WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI",
    "YU", "EU", "YI", "I"
]

JAMO_T_TABLE = [
    "", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM",
    "LB", "LS", "LT", "LP", "LH", "M", "B", "BS",
    "S", "SS", "NG", "J", "C", "K", "T", "P", "H"
]

def getHangulName(single_unicode_character):
    """Python straight conversion of Java reference
    implementation getHangulName() from
http://unicode.org/reports/tr15/#Hangul

    Non-pythonic, no change to names/code unless for syntax reasons

    Parameter:
        single_unicode_character - single Unicode character
    """
    # add assert unicode
    s = ord(single_unicode_character)
    SIndex = s - SBase;
    if (0 > SIndex or SIndex >= SCount):
        raise IllegalArgumentException("Not a Hangul Syllable: " + s);

    LIndex = SIndex / NCount;
    VIndex = (SIndex % NCount) / TCount;
    TIndex = SIndex % TCount;
    return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex] \
      + JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex];

def getCJKName(single_unicode_character):
    """names and functionality based on
    implementation of getHangulName() from
http://unicode.org/reports/tr15/#Hangul

    Parameter:
        single_unicode_character - single Unicode character
    """
    # add assert unicode
    s = ord(single_unicode_character)
    SIndex = s - SBase;
    # U+4E00 .. U+9FA5
    if (0x4E00 < s > 0x9FA5):
        raise IllegalArgumentException("Not a CJK Unified Ideograph: " + s);

    LIndex = SIndex / NCount;
    VIndex = (SIndex % NCount) / TCount;
    TIndex = SIndex % TCount;
    return "CJK UNIFIED IDEOGRAPH-%x" % s

Next message: Ken Whistler: "Re: UnicodeData.txt questions"
Previous message: announcements@unicode.org: "PRI #184: Proposed Update UTS #37, Unicode Ideographic Variation Database"
Next in thread: Ken Whistler: "Re: UnicodeData.txt questions"
Reply: Ken Whistler: "Re: UnicodeData.txt questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 27 2011 - 13:09:06 CDT