Re: If only MS Word was coded this well

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 08 2004 - 13:26:49 CST

  • Next message: Kenneth Whistler: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    "Theodore H. Smith" <delete@elfdata.com> writes:

    >>> It's because code points have variable lengths in bytes, so
    >>> extracting individual characters is almost meaningless
    >
    > Same with UTF-16 and UTF-32. A character is multiple code-points,
    > remember? (decomposed chars?)

    > Nope. I've done tons of UTF-8 string processing. I've even done a case
    > insensitive word-frequency measuring algorithm on UTF-8. It runs
    > blastingly fast, because I can do the processing with bytes.

    Ah, so first you say that "a character" mean "a base code point plus a
    number of combining code points", and then you admit that your program
    actually process strings in terms of even lower level units: bytes of
    UTF-8 encoding?

    Why don't you treat a string as a sequence of "base code point with
    combining code points" items?

    Answer: because often this grouping is irrelevant, like in your
    example of word statistics. Code point grouping is more important:
    Unicode algorithms are typically described in terms of code points.

    > It just requires you to understand the actual logic of UTF-8 well
    > enough to know that you can treat it as bytes, most of the time.

    When I implemented the word boundary algorithm from Unicode, I was
    glad that I could do it in terms of UTF-32 and ISO-8859-1 instead of
    UTF-8, even though I do understand the logic of UTF-8.

    > As for isspace... sure there is a UTF-8 non-byte space.

    I don't understand.

    If a string is exposed as a sequence of UTF-8 units, it makes no sense
    to ask whether a particular unit isspace. And it makes no sense to ask
    this about a whole string either. It would have to be a function which
    works in terms of some iterator over strings.

    Well, some things do work in terms of positions inside strings, for
    example word boundaries. But people are used to think about isspace as
    a property of a *character*, whatever the language exactly means under
    this concept. My language means a Unicode code point, for conceptual
    simplicity of the concept of a string as seen by the language.

    > My case insensitive utf-8 word frequency counter (which runs
    > blastingly fast) however didn't find this to be any problem. It
    > dealt with non-single byte all sorts of word breaks :o)
    >
    > It appears to run at about 3MB/second on my laptop, which involves
    > for every word, doing a word check on the entire previous collection
    > of words.

    I happen to have written a case insensitive word frequency counter as
    an example in my language, to test some Unicode algorithms.
    It uses the word boundary algorithm to specify words; a segment
    between boundaries must include a character of class L* or N* in order
    to be counted as a word. It maintains subcounts of case-sensitive
    forms of a case-insensitive word (implemented as a hash table of hash
    tables of integers). It converts input using iconv(), i.e. from an
    arbitrary locale encoding supported by the system.

    It was not written with speed in mind. It has 24 lines, 10 of which
    are formatting the output (statistics about 20 most common words).
    http://cvs.sourceforge.net/viewcvs.py/kokogut/kokogut/tests/WordStat.ko?view=markup

    It's written in a dynamically typed language, with dynamic dispatches
    and higher order functions everywhere, where all values except small
    integers are pointers, with immutable strings. Each line separately
    is divided into words; a subsequence of spaces is materialized as a
    string object before the program checks that there are no letters nor
    numbers in it and thus it's not a word.

    It processed 4.8MB in 3.2s on my machine (Athlon 2000, 1.25GHz), which
    I think is good enough under these conditions. This input happens to
    be ASCII (a mailbox) but the program didn't know beforehand that it's
    ASCII.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 13:28:12 CST