Re: Unihan.txt and the four dictionary sorting algorithm

From: Raymond Mercier (
Date: Tue Apr 20 2004 - 17:36:48 EDT

  • Next message: D. Starner: "Re: Unihan.txt and the four dictionary sorting algorithm"

    John Jenkins writes
    >>Also, even though the full Unihan database is 25+ Mb in size, given the
    cheapness of disk space nowadays, it's not all *that* big, surely.

    The problem of the size of Unihan has nothing at all to do with the cost of
    storage, and everything to do with the functioning of programs that might
    open and read it.
    Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, this
    means that when opened in notepad the lines are not separated. Notepad does
    have the advantage that the UTF-8 encoding is recognized, and the characters
    are displayed.

    If opened in Wordpad the Chinese characters do not appear, perhaps the UTF-8
    encoding does not function.

    If I try MS Word the machine grinds to a halt - and this is a good modern
    machine (XP with 120Mb HD and 512Mb RAM).

    Similarly if I open in IE6, with UTF-8 encoding, the text opens up to around
    U+4C00, and then grinds to a halt.

    I can open it in the HexWorkshop byte editor, or in the editor in Visual C
    6, but these do not recognize UTF-8 encoding, and they hardly count as
    suitable readers for such a file.

    I wish the people who designed this file would accept the need for a more
    structured and sophisticated approach. Why not, for example, have a basic
    html file, with html-links to the various sections ?

    Raymond Mercier

    This archive was generated by hypermail 2.1.5 : Tue Apr 20 2004 - 18:28:58 EDT