RE: Finite state machines? UTF8: toFold(), normalisation, etc

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon May 05 2003 - 12:28:10 EDT

Next message: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"

Previous message: Marco Cimarosti: "RE: character "stories""
Maybe in reply to: Theodore H. Smith: "Finite state machines? UTF8: toFold(), normalisation, etc"
Next in thread: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Reply: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Theodore H. Smith wrote:
> For example, a good finite state machine for my use, could
> extract some kind of about ASCII from the Unicode databases, that
> to lowercase something, you add 32 but only if the char is from
> 65 to 97.

There are several other runs of Latin, Greek, Cyrillic and other alphabets
that can be upper/lower-cased adding a constant. You can encode each one of
these runs with just 5 bytes:

- Code of first character -- 3 bytes unsigned
(e.g., for ASCII upper-casing: 0, 0, 97);

- Length of run -- 1 byte unsigned
(e.g., for ASCII upper-casing: 26)

- Constant to add -- 1 byte signed
(e.g., for ASCII upper-casing: -32)

If you search key also includes a mask to test for odd/even codes, this can
also work for several runs in Latin Extended-A/B, Latin Extended Additional,
and Greek Extended, where each capital is followed by its minuscule (in this
case, the constant to add is always +1 or -1, so you just need a one-bit
flag for storing the sign).

However, notice that not all case conversions can be done with this
technique, so you should think ahead for a slower but more general mechanism
for edge cases. E.g.:

- a single character corresponds to a *string* of characters in the
other case
(e.g.: "ß" [U+00DF] -> "SS" [U+0053 U+0053]);

- the other-case equivalent depends on language
(e.g.: "i" [U+0069] -> "?" [U+0130] in Turkish, but "I" [U+0049]
in other languages);

- the other-case equivalent depends on document's age
(e.g.:. all Georgian lower-case letters: they have an upper-case
equivalent only in ancient Georgian).

> Is that the way to go about it?

Well, it's your library, so that's up to you... :-)

Another thing to keep in mind is that Unicode Data change often, so it is
better if your property databases are (a) files external to the program and
(b) automatically generated from Unicode's text files.

_ Marco

Next message: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Previous message: Marco Cimarosti: "RE: character "stories""
Maybe in reply to: Theodore H. Smith: "Finite state machines? UTF8: toFold(), normalisation, etc"
Next in thread: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Reply: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 05 2003 - 13:04:13 EDT