Re: per-character "stories" in a database (derives from Re: geometric shapes)

From: William Overington (
Date: Fri Mar 14 2003 - 11:08:07 EST

  • Next message: Otto Stolz: "Re: Need encoding conversion routines"

    Markus Scherer wrote as follows.


    It has been suggested many times to build a database (list, document, XML,
    ...) where each designated/assigned code point and each character gets its
    "story": Comments on the glyphs, from what codepage it was inherited, usage
    comments and examples, alternate names, etc.

    I am talking about both code points and "characters" on purpose, and I
    would go a step beyond documenting what's there. All the "characters" that
    can be represented by a sequence of assigned Unicode characters should be
    listed, with that sequence (or those sequences), and with further
    explanation if necessary.

    end quote

    Yes, that is a very good point. I have become interested in the languages
    of the Indian subcontinent from the standpoint of trying to ensure that they
    can be displayed properly using interactive television using portable font
    technology, however I am not a linguist and I find it strange that the
    Unicode Standard does not codify the ligatures which can be produced with
    the languages of the Indian subcontinent at display time using specific
    sequences of regular Unicode characters so that someone skilled in the art
    of font design may design a font from the code charts.

    Later he wrote.


    Now we just need to
    - find someone to sponsor this effort technically and with humanpower
    - squeeze the existing information out of the standard, the mailing lists,
    FAQs, and of course out of the Unicode veterans before they retire by
    Unicode 6...

    end quote

    Well, how about an approach like Project Gutenberg uses for proofreading
    transcripts of classic books. If there were a database where people could
    post items about particular characters and people could read them and either
    confirm what is said or put some other view or just add some other
    information, then maybe the database could just sort of gradually become
    generated over a period of years. How big would that be? About 100
    thousand code points at, say, 200 words for each on average at about 5 or 6
    characters per word on average with a space following each word would be
    about 130 megabytes in total. I fully realize that the phrase "sort of
    gradually" might easily be quoted in a response to this posting, yet if the
    database facility were there, accessible directly from the web, there may
    well be many people who would stop by for a while and review what has been
    entered and add a little more to the database.

    >PS: Sorry, I am not in a position to volunteer...

    Well, it could be more of an informal thing. If the facility were set up,
    then people who are interested could simply visit the web site when they
    felt like participating. Certainly there might be a core of people who had
    the ability to throw out rubbish and to convert fragments of text into a
    good English narrative so that there was some overall structure to it all,
    yet it does not necessarily need to be as formal and rigid as if it were a
    commercial project with a time deadline, particularly if the alternative is
    that it does not get done at all.

    William Overington

    14 March 2003

    This archive was generated by hypermail 2.1.5 : Fri Mar 14 2003 - 12:12:32 EST