From: Philippe Verdy (email@example.com)
Date: Fri Dec 03 2004 - 17:19:38 CST
From: "Theo" <firstname.lastname@example.org>
>> From: Asmus Freytag <email@example.com>
> So, despite it being UTF-8 case insensitive, it was totally blastingly
> fast. (One person reported counting words at 1MB/second of pure text, from
> within a mixed Basic / C environment). You'll need to keep in mind, that
> the counter must look up through thousands of words (Every single word its
> come across in the text), on every single word lookup.
> Anyhow, from my experience, UTF-8 is great for speed and RAM.
Probably true for English or most Western European Latin-based languages
(plus Greek and Coptic).
But for other languages that still use lots of characters in the range
U+0000 to U+03FF (C0 and C1 controls, Basic Latin, Latin-1 suplement, Latin
Extended-A and -B, IPA Extensions, Spacing Modifier Letters, Combining
Diacritical Marks, Greek and Coptic) UTF-8 and UTF-16 may be nearly as
For all others, that need lots of characters out of the range U+0000 to
U+03FF (Cyrillic, Armenian, Hebrew, Arabic, and all Asian or Native-American
or African scripts, or even PUAs), UTF-16 is better (more compact in memory,
UTF-32 will be better only for historic texts written nearly completely with
characters out of the BMP (for now, only Old Ialic, Gothic, Ugaritic,
Deseret, Shavian, Osmanya, Cypriot Syllabary), if C0 controls (such as TAB,
CR and LF), or ASCII SPACE, or NBSP are a minority.
This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 17:21:32 CST