From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 08 2004 - 13:26:49 CST
"Theodore H. Smith" <delete@elfdata.com> writes:
>>> It's because code points have variable lengths in bytes, so
>>> extracting individual characters is almost meaningless
>
> Same with UTF-16 and UTF-32. A character is multiple code-points,
> remember? (decomposed chars?)
> Nope. I've done tons of UTF-8 string processing. I've even done a case
> insensitive word-frequency measuring algorithm on UTF-8. It runs
> blastingly fast, because I can do the processing with bytes.
Ah, so first you say that "a character" mean "a base code point plus a
number of combining code points", and then you admit that your program
actually process strings in terms of even lower level units: bytes of
UTF-8 encoding?
Why don't you treat a string as a sequence of "base code point with
combining code points" items?
Answer: because often this grouping is irrelevant, like in your
example of word statistics. Code point grouping is more important:
Unicode algorithms are typically described in terms of code points.
> It just requires you to understand the actual logic of UTF-8 well
> enough to know that you can treat it as bytes, most of the time.
When I implemented the word boundary algorithm from Unicode, I was
glad that I could do it in terms of UTF-32 and ISO-8859-1 instead of
UTF-8, even though I do understand the logic of UTF-8.
> As for isspace... sure there is a UTF-8 non-byte space.
I don't understand.
If a string is exposed as a sequence of UTF-8 units, it makes no sense
to ask whether a particular unit isspace. And it makes no sense to ask
this about a whole string either. It would have to be a function which
works in terms of some iterator over strings.
Well, some things do work in terms of positions inside strings, for
example word boundaries. But people are used to think about isspace as
a property of a *character*, whatever the language exactly means under
this concept. My language means a Unicode code point, for conceptual
simplicity of the concept of a string as seen by the language.
> My case insensitive utf-8 word frequency counter (which runs
> blastingly fast) however didn't find this to be any problem. It
> dealt with non-single byte all sorts of word breaks :o)
>
> It appears to run at about 3MB/second on my laptop, which involves
> for every word, doing a word check on the entire previous collection
> of words.
I happen to have written a case insensitive word frequency counter as
an example in my language, to test some Unicode algorithms.
It uses the word boundary algorithm to specify words; a segment
between boundaries must include a character of class L* or N* in order
to be counted as a word. It maintains subcounts of case-sensitive
forms of a case-insensitive word (implemented as a hash table of hash
tables of integers). It converts input using iconv(), i.e. from an
arbitrary locale encoding supported by the system.
It was not written with speed in mind. It has 24 lines, 10 of which
are formatting the output (statistics about 20 most common words).
http://cvs.sourceforge.net/viewcvs.py/kokogut/kokogut/tests/WordStat.ko?view=markup
It's written in a dynamically typed language, with dynamic dispatches
and higher order functions everywhere, where all values except small
integers are pointers, with immutable strings. Each line separately
is divided into words; a subsequence of spaces is materialized as a
string object before the program checks that there are no letters nor
numbers in it and thus it's not a word.
It processed 4.8MB in 3.2s on my machine (Athlon 2000, 1.25GHz), which
I think is good enough under these conditions. This input happens to
be ASCII (a mailbox) but the program didn't know beforehand that it's
ASCII.
-- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 13:28:12 CST