Re: Unicode for words?

From: D. Starner (
Date: Sun Dec 05 2004 - 18:17:45 CST

  • Next message: Chris Jacobs: "Re: No Invisible Character - NBSP at the start of a word"

    "Philippe Verdy" <> writes:

    > > Drop the part of the sentence before "then". A protocol could delete "the", "an", etc. right
    > > now. In fact, I suspect several library systems do drop "the", etc. right now. Not that this
    > > makes it a good idea, but that's a lousy argument.
    > If such a library does this, only based on the presence of the encoded words, without wondering
    > in which language the text is written, that kind of processing text will be seriously
    > inefficient or inaccurate when processing other languages than English for which you will have
    > built such a library.

    Many libraries have large amounts of books in English, French, German, Spanish, Italian,
    and various non-Latin languages. Blanket stripping of a, an, the, and la from the
    start of a title might very well be good 90% heuristic for removing non-sorting
    words from the start of titles. (German being the odd man out, since you can't blanket
    remove a starting die.)

    > For plain-text (which is what Unicode deals about), even the "an", "the", "is" words (and so
    > on...) are equally important as other parts of the text.

    No. It all depends on what you want to do with the text.

    Besides which, the point is it doesn't matter whether or not words are encoded as
    codepoints; these process can work just the same.

    Sign-up for Ads Free at

    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 18:19:20 CST