Re: algorithm implementation: coping with code properties

From: Mark Davis ☕ (mark@macchiato.com)
Date: Thu Jan 28 2010 - 11:52:13 CST

  • Next message: spir: "Re: algorithm implementation: coping with code properties"

    ICU has had to develop compact-but-fast data structures for many issues
    dealing with Unicode. If it is possible for you to use ICU, it would save
    reinventing the (very big) wheel. If not, you might look at ICU's
    implementation (it's non-viral open-source) as a reference.

    Also see my "Bits of Unicode" on
    http://www.macchiato.com/unicode/other-slides, discussing data structures.

    Mark

    On Thu, Jan 28, 2010 at 03:45, spir <denis.spir@free.fr> wrote:

    > Hello,
    >
    >
    > It seems code property files typically look like (example from
    > GraphemeBreakProperty.txt):
    >
    > ================================
    > ...
    > 007F..009F ; Control # Cc [33] <control-007F>..<control-009F>
    > 00AD ; Control # Cf SOFT HYPHEN
    > 0600..0603 ; Control # Cf [4] ARABIC NUMBER SIGN..ARABIC SIGN SAFHA
    > 06DD ; Control # Cf ARABIC END OF AYAH
    > 070F ; Control # Cf SYRIAC ABBREVIATION MARK
    > ...
    > 0591..05BD ; Extend # Mn [45] HEBREW ACCENT ETNAHTA..HEBREW POINT METEG
    > 05BF ; Extend # Mn HEBREW POINT RAFE
    > ...
    > ================================
    >
    > Meaning each code or range is individually assigned a property. I imagine 2
    > ways to cope with such data in an application.
    >
    >
    > -1- using the list as is -------
    >
    > Parse the list as is into a collection (eg associative array) that maps
    > each code or range to its property value. Concretely, there can also be two
    > separate collections for individual codes and ranges, but the principle
    > remains. (Note: Even for individual codes, the collection cannot be a list,
    > indeed, for there may be huge holes between code values).
    >
    > Let's say property values can be a,b,c. The algorithm's main loop content
    > may look like:
    >
    > <get p from c>
    > if p == a then ...
    > elseif p == b then ...
    > ...
    > else ...
    >
    > (Actually, algorithms often require testing properties of a pair of
    > successive codes in string, but this does not change the overall scheme.)
    >
    > The issue is to get p when c is not indvidually listed; this may be the
    > common case, since ranges usually cover a greater number of code values. We
    > then need to iterate over a *global* list of ranges, checking whether any of
    > them holds c. (The code's property value is always searched before the
    > series of tests.)
    >
    >
    > -2- constructing categories -------
    >
    > An alternative is to create categories (see below *) A,B,C holding the
    > codes or ranges which property is a,b,c, respectively.
    > Then, the algorithm rather looks like (OO-like pseudo-code):
    >
    > if A.holds(c) then ...
    > elseif B.holds(c) then ...
    > ...
    > else ...
    >
    > Again, if ever c is not individually listed, we need to iterate over
    > ranges. But each category only holds a subset of the global list of ranges,
    > and the search will stop as soon as one matches. (Note the property is not
    > searched _before_ the series of tests.)
    >
    > Anyway, aside performance, I find this solution cleaner --for any reason.
    >
    > A small issue is it first requires grouping the source data under titles,
    > eg:
    > ================================
    > Control:
    > ...
    > 007F..009F
    > 00AD
    > 0600..0603
    > 06DD
    > 070F
    > ...
    > Extend:
    > ...
    > 0591..05BD
    > 05BF
    > ...
    > ================================
    > Or even more directly something like:
    > ================================
    > ControlCodes = (..., (007F..009F), 00AD, (0600..0603), 06DD, 070F, ...)
    > ExtendCodes = (..., (0591..05BD), 05BF, ...)
    > ================================
    > But this needs to be done only once per data file and, indeed, the same
    > data parsing/rewriting tool func can apply to all files.
    >
    >
    > What do you think?
    >
    >
    > Denis
    >
    >
    > (*) Category
    > A category conceptually is a list of ranges, plus a set of individual
    > codes. Testing whether a code belongs to it is simply checking the set, then
    > each range.
    > A type can be built for this kind of object, or simply a factory func (that
    > builds it from a list of codes and ranges) and an associated "holds" func.
    > Example code available in Lua and Python.
    > ________________________________
    >
    > la vita e estrany
    >
    > http://spir.wikidot.com/
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Jan 28 2010 - 11:55:30 CST