Invalid UTF-8 sequences (was RE: Unicode Search Engines)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Jan 29 2002 - 08:47:43 EST


Hmmmm, this reminds me... Currently, I saw that search engines return the
results in a certain encoding. Therefore for some pages text excerpt (or
summary) cannot be shown. OK, this will get much better when the search
results will be in UTF-8.

I have one concern though. Conversion from an html file using a codeset to
the results html file (well, of course the engine's database too) is
possible only when the codeset information is available. But there are also
plain text files (.txt) on the web. Suppose I get a hit for such a page. The
hit could be a word in the file that is pure ASCII, but the neighboring text
could be in an unknown codeset.

Simply including a portion of that file into an html file marked as UTF-8
can obviously result in invalid UTF-8 sequences. That is somewhat bad in
itself, and it gets even worse if for some reason that file is converted
from UTF-8 to UTF-16.

Regards,

Lars Kristan

> -----Original Message-----
> From: Stefan Probst [mailto:stefan.probst@opticom.v-nam.net]
> Sent: Monday, January 28, 2002 16:18
> To: unicode@unicode.org
> Subject: Re: Unicode Search Engines
>
>
> On Wed Jan 16 23:49:29 2002 +0400 Aman Chawla wrote:
> >Are there any search engines at all at present which allow
> one to search
> >sites encoded in UTF-8? If not, are there plans to build such search
> >engines? For example, is Google going to implement such an engine?
>
> I would like to add:
> How do they handle normalization?
> In Vietnam, many characters can be represented in several
> different ways:
> (1) fully precomposed (NFC)
> (2) base character and modifier precomposed, tonal mark combining
> (3) base character, then modifier, then tonal mark
> (4) like (3), but modifier and tonal mark sorted (NFD)
> Do the search engines do any normalization, before indexing a page?
> Are queries normalized before running the search?
>
> In other words:
> For example, if the page is written in NFC, but the query is
> entered in
> NFD, will it find anything?
>
> Rgds,
> Stefan
>
>



This archive was generated by hypermail 2.1.2 : Tue Jan 29 2002 - 08:19:28 EST