Re: Finite state machines? UTF8: toFold(), normalisation, etc

From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue May 06 2003 - 12:13:14 EDT

  • Next message: Markus Scherer: "Re: Finite state machines? UTF8: toFold(), normalisation, etc"

    A minor correction: a trie may be more than two stages.

    We also have an enhanced version (developed by Markus Scherer) called
    a folded trie, that we use in ICU. See
    http://oss.software.ibm.com/icu/docs/papers/foldedtrie_iuc21.ppt.

    These data structures (and others that we use in ICU) are flattened;
    that is, they are stored as a continuous block of memory, with offsets
    instead of pointers. That way, they can all be used in read-only
    memory-mapped files for fast initialization.

    Märk Davis
    ________
    mark.davis@jtcsv.com
    IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
    (408) 256-3148
    fax: (408) 256-0799

    ----- Original Message -----
    From: "Addison Phillips [wM]" <aphillips@webmethods.com>
    To: "Theodore H. Smith" <delete@elfdata.com>
    Cc: <unicode@unicode.org>
    Sent: Tuesday, May 06, 2003 08:00
    Subject: Re: Finite state machines? UTF8: toFold(), normalisation, etc

    > Hi,
    >
    > A "trie" is a two stage table.
    >
    > PPT is PowerPoint. I was under the impression that the presentation
    was
    > available in PDF or HTML format also, but I guess it isn't if you
    didn't
    > see it in that format.
    >
    > Here's a hint, though: if you search "Bits of Unicode" on Google,
    you
    > can view much of the paper in HTML format using the option Google
    > provides there.
    >
    > Best Regards,
    >
    > Addison
    >
    > Theodore H. Smith wrote:
    > > Hi Addison,
    > >
    > > Thanks a lot for the answers that may help me get a clean
    solution.
    > >
    > > I'm unfamiliar with "trie". What does it mean? If it's less
    complex than
    > > a finite state machine I'm sure that'll be a benefit for me.
    > >
    > > "Bits of Unicode" is in .ppt format. Is that "Power point"? I
    don't have
    > > powerpoint or an app to read .ppt.
    > >
    > > Thanks a lot for your kind help.
    > >
    > >> Hi Mr. Smith,
    > >>
    > >> I wrote about "compiling" the Unicode character data tables in
    my
    > >> response. That reply was somewhat sketchy: my three-year old son
    was
    > >> sitting in my lap waiting for his machine to boot while I wrote
    it...
    > >>
    > >> Mark Davis wrote more-or-less the canonical presentation on this
    > >> subject for an IUC conference a few years ago. The title was
    "Bits of
    > >> Unicode". It may be elsewhere, but I've always found it on his
    > >> personal page http://www.macchiato.com
    > >>
    > >> I have personally had reason to compile my own tables (NOT using
    a
    > >> finite state language, just tries and similar structures) for
    purposes
    > >> beyond those of ICU. But I must admit that in recent years I
    have
    > >> tended to extend ICU or the very similar code in the Java JDK
    instead
    > >> of implementing my own tables, but it isn't that hard to do.
    Getting
    > >> the edge cases and esoteric details right, though, make it not
    worth
    > >> my while (in my estimation).
    > >>
    > >> A finite state machine could certainly do "the job" (although
    what you
    > >> really have is a number of similar "jobs" to do), but trie
    tables and
    > >> similar structures are a lot easier to build and maintain and do
    the
    > >> job marvelously well.
    > >>
    > >> Good luck with your implementation.
    > >
    > >
    > >
    > > --
    > > Theodore H. Smith - Macintosh Consultant / Contractor.
    > > My website: <www.elfdata.com/>
    > >
    >
    >
    > --
    > Addison P. Phillips
    > Director, Globalization Architecture
    > webMethods, Inc.
    >
    > +1 408.962.5487 mailto:aphillips@webmethods.com
    > -------------------------------------------
    > Internationalization is an architecture. It is not a feature.
    >
    > Chair, W3C I18N WG Web Services Task Force
    > http://www.w3.org/International/ws
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue May 06 2003 - 13:14:59 EDT