RE: Unicode Searching

From: Addison Phillips (AddisonP@simultrans.com)
Date: Thu Apr 29 1999 - 12:07:43 EDT


True multi-lingual searching is a HIGHLY non-trivial topic. You can adjust (well, rewrite) a search program to work for a specific language or family of languages, but, as we've both pointed out, "searching text" means something different in each locale.

However, there is a possibility that Randy's problems are simpler and he merely wants to search English-oriented text that might contain some "foreign language" characters. Unicode enabling his software is the place to start, but it certainly is only the very small tip of the iceberg...

Addison

-----Original Message-----
From: Jonathan Rosenne [mailto:rosenne@qsm.co.il]
Sent: Thursday, April 29, 1999 12:13 AM
To: Unicode List
Subject: RE: Unicode Searching


I would like to take this opportunity to remind those concerned that
searching in the international environment requires more than the correct
processing of characters. Most search engines are not only Western oriented,
but even more narrowly oriented to English.

For Hebrew and Arabic, the usefulness of common search tools is limited
because they do not handle the rich set of prefixes and suffixes we use and
other types of declination that are rare with European languages. For
example, when you search for a word, you will normally not find instances
that are plural (similar to man - men in English).

Jony

> -----Original Message-----
> From: Addison Phillips [mailto:AddisonP@simultrans.com]
> Sent: Wednesday, April 28, 1999 9:24 PM
> To: Unicode List
> Subject: RE: Unicode Searching
>
>
> Hi Randy,
>
> There are a number of issues to consider before starting.
>
> From a practical perspective, you will need to enable your code to process
> Unicode characters. This is a relatively straightforward process
> and can be
> accomplished using either in a variety of ways (Win32 API, Standard C/C++,
> etc.). Microsoft has extensive documentation in MSDN, the VC++
> documentation, and on their website on how, exactly, to do this (but note
> the platform caveats).
>
> Secondly, you will need to handle some elements of Unicode specific to the
> character set. In particular you will need to address character
> composition
> (e.g. multiple code points that combine to form a character--so
> you can have
> "n" and "~" combine to form "" or just use the precomposed
> character). The
> data you are searching may be precomposed (or not) and the search strings
> from the user may be precomposed (or not--more likely not).
>
> Thirdly, most data won't be in Unicode to start. You'll need to be able to
> convert character sets to Unicode. Since your product searches HTML files,
> you'll need to convert HTML entities too. Some of this is built into WinNT
> (or is easily added), but not immediately available on Win9x and Win3x.
>
> You will also need to decide how to handle casing (uppercase/lowercase),
> text parsing, and wildcard matches. The meaning of these (and other text
> attributes) changes with language/locale and you will need to
> decide how to
> handle multi-lingual text (if that is really what your
> application is going
> to do) in a straightforward and intelligent manner. Multi-lingual
> searching
> is a non-trivial topic that I haven't the bandwidth to address
> here, but you
> may not really have that as a requirement. However, your desire to support
> "double-byte" does open up a whole can of worms starting with:
> there are no
> spaces in most Asian language text runs, so your default word parsing
> mechanism will change.
>
> You should be aware that 'Windows' covers a lot of ground and that
> implementing Unicode support on Win9x is possible, but that
> Unicode support
> is significantly different than under WinNT. Fortunately there
> are libraries
> that address some of these issues.
>
> Regards,
>
> Addison
> __________________________________________
>
> Addison Phillips
> Director, Globalization Services
> SimulTrans, L.L.C.
> 2606 Bayshore Parkway
> Mountain View, California 94043 USA
>
> +1 650-526-4652 (direct telephone)
> +1 650-969-9959 (facsimile)
> AddisonP@simultrans.com (Internet email)
> http://www.simultrans.com (website)
>
> "22 languages. One release date."
> __________________________________________
>
>
>
>
>
>
> -----Original Message-----
> From: Randy Hughes [mailto:hughesr@unidial.com]
> Sent: Wednesday, April 28, 1999 11:14 AM
> To: Unicode List
> Subject: Unicode Searching
>
>
> I have written a Searching application for Windows. I am interested in
> adding Unicode searching capability to it. Can someone give me a brief
> list of issues to consider, or point me to a good starting point
> for adding
> this capability. If you need to see the product it can be downloaded from
> my website listed below. It will currently handle only
> single-byte, and I
> am trying to figure out how to get it to Double-Byte.
>
> Thanks
> Randy Hughes
> Jr Computing
> http://www.jrcomputing.com
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT