RE: Finite state machines? UTF8: toFold(), normalisation, etc

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon May 05 2003 - 12:28:10 EDT

  • Next message: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"

    Theodore H. Smith wrote:
    > For example, a good finite state machine for my use, could
    > extract some kind of about ASCII from the Unicode databases, that
    > to lowercase something, you add 32 but only if the char is from
    > 65 to 97.

    There are several other runs of Latin, Greek, Cyrillic and other alphabets
    that can be upper/lower-cased adding a constant. You can encode each one of
    these runs with just 5 bytes:

            - Code of first character -- 3 bytes unsigned
              (e.g., for ASCII upper-casing: 0, 0, 97);

            - Length of run -- 1 byte unsigned
             (e.g., for ASCII upper-casing: 26)

            - Constant to add -- 1 byte signed
             (e.g., for ASCII upper-casing: -32)

    If you search key also includes a mask to test for odd/even codes, this can
    also work for several runs in Latin Extended-A/B, Latin Extended Additional,
    and Greek Extended, where each capital is followed by its minuscule (in this
    case, the constant to add is always +1 or -1, so you just need a one-bit
    flag for storing the sign).

    However, notice that not all case conversions can be done with this
    technique, so you should think ahead for a slower but more general mechanism
    for edge cases. E.g.:

            - a single character corresponds to a *string* of characters in the
    other case
              (e.g.: "ß" [U+00DF] -> "SS" [U+0053 U+0053]);

            - the other-case equivalent depends on language
              (e.g.: "i" [U+0069] -> "?" [U+0130] in Turkish, but "I" [U+0049]
    in other languages);

            - the other-case equivalent depends on document's age
              (e.g.:. all Georgian lower-case letters: they have an upper-case
    equivalent only in ancient Georgian).

    > Is that the way to go about it?

    Well, it's your library, so that's up to you... :-)

    Another thing to keep in mind is that Unicode Data change often, so it is
    better if your property databases are (a) files external to the program and
    (b) automatically generated from Unicode's text files.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Mon May 05 2003 - 13:04:13 EDT