Re: Finite state machines? UTF8: toFold(), normalisation, etc

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Mon May 05 2003 - 14:06:42 EDT

Next message: Kenneth Whistler: "RE: character "stories""

Previous message: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
In reply to: Theodore H. Smith: "Finite state machines? UTF8: toFold(), normalisation, etc"
Next in thread: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Reply: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi Mr. Smith,

I wrote about "compiling" the Unicode character data tables in my
response. That reply was somewhat sketchy: my three-year old son was
sitting in my lap waiting for his machine to boot while I wrote it...

Mark Davis wrote more-or-less the canonical presentation on this subject
for an IUC conference a few years ago. The title was "Bits of Unicode".
It may be elsewhere, but I've always found it on his personal page
http://www.macchiato.com

I have personally had reason to compile my own tables (NOT using a
finite state language, just tries and similar structures) for purposes
beyond those of ICU. But I must admit that in recent years I have tended
to extend ICU or the very similar code in the Java JDK instead of
implementing my own tables, but it isn't that hard to do. Getting the
edge cases and esoteric details right, though, make it not worth my
while (in my estimation).

A finite state machine could certainly do "the job" (although what you
really have is a number of similar "jobs" to do), but trie tables and
similar structures are a lot easier to build and maintain and do the job
marvelously well.

Good luck with your implementation.

Best Regards,

Addison

Theodore H. Smith wrote:
>
> Hi people,
>
> thanks for the answers. I won't be using ICU at all. I am writing my OWN
> UTF8 processing library you see, not writing something that uses it.
> Different ballgame.
>
> I find the ICU source very difficult to even get started with, firstly,
> there is so much of it! I do dislike overly complex systems, prefering
> to go for the streamlined ultra-tight approach. I'll see what I can
> learn from ICU, though, perhaps I'll come across the files for toLower
> or toFold, etc.
>
> I'm wondering if the MacOS build's some kind of table? Someone else
> mentioned that the way was to extract the information from the Unicode
> database, into a more compact form. Would that be writing some kind of
> finite state machine that analyses the data for a very compact
> representation?
>
> For example, a good finite state machine for my use, could extract some
> kind of about ASCII from the Unicode databases, that to lowercase
> something, you add 32 but only if the char is from 65 to 97. So that
> would be really just two "if tests" for all of ascii, and perhaps it
> could find some other patterns about other languages... Or if not store
> it in a compact form.
>
> Is that the way to go about it?
>
>> Dear Mr. Smith,
>>
>> That's a lot of different things, some of which are not entirely based
>> on Unicode properties. Collation, for example, is strongly affected by
>> language.
>>
>> Unicode character properties provide the information you need to
>> implement many of these functions. The Unicode character data files
>> have fields that you can compile into data tables for this purpose.
>>
>> Before you go off and do that, you should take a look at libraries
>> that have already done it. I recommend a close look at the ICU library
>> (http://oss.software.ibm.com/icu) as an excellent starting point.
>>
>> You should also look closely at the FAQ and technical documentation on
>> the Unicode website, if you have not already.
>>
>> I should note that very few applications work directly on UTF-8 byte
>> sequences. Most choose to process Unicode using UTF-16 or UTF-32 in
>> memory, even if the ultimate representation is UTF-8.
>>
>> I hope this helps for starters.
>>
>> Best Regards,
>>
>> Addison
>>
>>
>>
>> --
>> Addison P. Phillips
>> Director, Globalization Architecture
>> webMethods, Inc.
>>
>> +1 408.962.5487 mailto:aphillips@webmethods.com
>> -------------------------------------------
>> Internationalization is an architecture. It is not a feature.
>>
>> Chair, W3C I18N WG Web Services Task Force
>> http://www.w3.org/International/ws
>>
>>
>> Theodore H. Smith wrote:
>>
>>> Hi list,
>>> I need to implement some way to implement toUpper(), toFold(),
>>> normalisation, collation, and perhaps other Unicode features I may
>>> have missed out, on UTF8 strings stored in the RAM.
>>> I need to implement it for Windows (32-bit), MacOS9 and MacOSX.
>>> I have other Unicode processing code, already, but not these or
>>> anything close to these.
>>> I heard that the only way is to read out the character information
>>> from a database? My whole string processing library, with hundreds of
>>> functions and a few properties, is only 54k. I don't want to add 200k
>>> of database reading code and then huge Unicode database files to this
>>> 54k.
>>> How is this best done, then? I'm assuming there isn't any
>>> mathematical way to figure out a codepoint's properties? So where do
>>> I get this data and what's the fastest way to do it?
>>> --
>>> Theodore H. Smith - Macintosh Consultant / Contractor.
>>> My website: <www.elfdata.com/>
>>
>>
>>
>>
>>
> --
> Theodore H. Smith - Macintosh Consultant / Contractor.
> My website: <www.elfdata.com/>
>
>

-- 
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
+1 408.962.5487  mailto:aphillips@webmethods.com
-------------------------------------------
Internationalization is an architecture. It is not a feature.
Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws

Next message: Kenneth Whistler: "RE: character "stories""
Previous message: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
In reply to: Theodore H. Smith: "Finite state machines? UTF8: toFold(), normalisation, etc"
Next in thread: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Reply: Theodore H. Smith: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 05 2003 - 15:06:38 EDT