Re: Finite state machines? UTF8: toFold(), normalisation, etc

From: Theodore H. Smith (delete@elfdata.com)
Date: Mon May 05 2003 - 12:54:47 EDT

  • Next message: Addison Phillips [wM]: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"

    Hi Marco,

    thanks a lot for the pointers.

    > If you search key also includes a mask to test for odd/even codes,
    > this can
    > also work for several runs in Latin Extended-A/B, Latin Extended
    > Additional,
    > and Greek Extended, where each capital is followed by its minuscule
    > (in this
    > case, the constant to add is always +1 or -1, so you just need a
    > one-bit
    > flag for storing the sign).

    That's a nice hint, thanks.

    > However, notice that not all case conversions can be done with this
    > technique, so you should think ahead for a slower but more general
    > mechanism
    > for edge cases. E.g.:
    >
    > - a single character corresponds to a *string* of characters in the
    > other case
    > (e.g.: "ß" [U+00DF] -> "SS" [U+0053 U+0053]);

    That and precomposed/decomposed mappings also need to be looked at, yes.

    > - the other-case equivalent depends on language
    > (e.g.: "i" [U+0069] -> "?" [U+0130] in Turkish, but "I" [U+0049]
    > in other languages);

    I'll be able to deal with that also.

    > - the other-case equivalent depends on document's age
    > (e.g.:. all Georgian lower-case letters: they have an upper-case
    > equivalent only in ancient Georgian).

    That's quite an interesting distinction, although I don't think this
    will affect my code.

    >> Is that the way to go about it?
    >
    > Well, it's your library, so that's up to you... :-)

    Well perhaps there is a better way than my finite state machine idea.
    Or perhaps that is "the way" it is generally done, because it's what
    happens to work best with Unicode. If so, then perhaps there are some
    kind of general pointers about how to implement this.

    > Another thing to keep in mind is that Unicode Data change often, so it
    > is
    > better if your property databases are (a) files external to the
    > program and
    > (b) automatically generated from Unicode's text files.

    Yes I have this in mind also. I'll do it in 3 stages.

    1) Code to extract information into a compressed state loadable by my
    finite state machine
    2) Code to load the compressed information into my finite state machine
    3) The finite state machine

    This way, it'll hopefully be a "write once, use forever" (or until
    Unicode.org manages to add some character mapping that breaks my old
    assumptions).

    --
         Theodore H. Smith - Macintosh Consultant / Contractor.
         My website: <www.elfdata.com/>
    


    This archive was generated by hypermail 2.1.5 : Mon May 05 2003 - 13:42:25 EDT