Re: Finite state machines? UTF8: toFold(), normalisation, etc

From: Theodore H. Smith (delete@elfdata.com)
Date: Mon May 05 2003 - 12:54:47 EDT

Next message: Addison Phillips [wM]: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"

Previous message: Marco Cimarosti: "RE: Finite state machines? UTF8: toFold(), normalisation, etc"
In reply to: Marco Cimarosti: "RE: Finite state machines? UTF8: toFold(), normalisation, etc"
Next in thread: Addison Phillips [wM]: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi Marco,

thanks a lot for the pointers.

> If you search key also includes a mask to test for odd/even codes,
> this can
> also work for several runs in Latin Extended-A/B, Latin Extended
> Additional,
> and Greek Extended, where each capital is followed by its minuscule
> (in this
> case, the constant to add is always +1 or -1, so you just need a
> one-bit
> flag for storing the sign).

That's a nice hint, thanks.

> However, notice that not all case conversions can be done with this
> technique, so you should think ahead for a slower but more general
> mechanism
> for edge cases. E.g.:
>
> - a single character corresponds to a *string* of characters in the
> other case
> (e.g.: "ß" [U+00DF] -> "SS" [U+0053 U+0053]);

That and precomposed/decomposed mappings also need to be looked at, yes.

> - the other-case equivalent depends on language
> (e.g.: "i" [U+0069] -> "?" [U+0130] in Turkish, but "I" [U+0049]
> in other languages);

I'll be able to deal with that also.

> - the other-case equivalent depends on document's age
> (e.g.:. all Georgian lower-case letters: they have an upper-case
> equivalent only in ancient Georgian).

That's quite an interesting distinction, although I don't think this
will affect my code.

>> Is that the way to go about it?
>
> Well, it's your library, so that's up to you... :-)

Well perhaps there is a better way than my finite state machine idea.
Or perhaps that is "the way" it is generally done, because it's what
happens to work best with Unicode. If so, then perhaps there are some
kind of general pointers about how to implement this.

> Another thing to keep in mind is that Unicode Data change often, so it
> is
> better if your property databases are (a) files external to the
> program and
> (b) automatically generated from Unicode's text files.

Yes I have this in mind also. I'll do it in 3 stages.

1) Code to extract information into a compressed state loadable by my
finite state machine
2) Code to load the compressed information into my finite state machine
3) The finite state machine

This way, it'll hopefully be a "write once, use forever" (or until
Unicode.org manages to add some character mapping that breaks my old
assumptions).

--
     Theodore H. Smith - Macintosh Consultant / Contractor.
     My website: <www.elfdata.com/>

Next message: Addison Phillips [wM]: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Previous message: Marco Cimarosti: "RE: Finite state machines? UTF8: toFold(), normalisation, etc"
In reply to: Marco Cimarosti: "RE: Finite state machines? UTF8: toFold(), normalisation, etc"
Next in thread: Addison Phillips [wM]: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 05 2003 - 13:42:25 EDT