RE: Unicode Searching

From: Addison Phillips (AddisonP@simultrans.com)
Date: Wed Apr 28 1999 - 15:13:47 EDT


Hi Randy,

There are a number of issues to consider before starting.

From a practical perspective, you will need to enable your code to process
Unicode characters. This is a relatively straightforward process and can be
accomplished using either in a variety of ways (Win32 API, Standard C/C++,
etc.). Microsoft has extensive documentation in MSDN, the VC++
documentation, and on their website on how, exactly, to do this (but note
the platform caveats).

Secondly, you will need to handle some elements of Unicode specific to the
character set. In particular you will need to address character composition
(e.g. multiple code points that combine to form a character--so you can have
"n" and "~" combine to form "ñ" or just use the precomposed character). The
data you are searching may be precomposed (or not) and the search strings
from the user may be precomposed (or not--more likely not).

Thirdly, most data won't be in Unicode to start. You'll need to be able to
convert character sets to Unicode. Since your product searches HTML files,
you'll need to convert HTML entities too. Some of this is built into WinNT
(or is easily added), but not immediately available on Win9x and Win3x.

You will also need to decide how to handle casing (uppercase/lowercase),
text parsing, and wildcard matches. The meaning of these (and other text
attributes) changes with language/locale and you will need to decide how to
handle multi-lingual text (if that is really what your application is going
to do) in a straightforward and intelligent manner. Multi-lingual searching
is a non-trivial topic that I haven't the bandwidth to address here, but you
may not really have that as a requirement. However, your desire to support
"double-byte" does open up a whole can of worms starting with: there are no
spaces in most Asian language text runs, so your default word parsing
mechanism will change.

You should be aware that 'Windows' covers a lot of ground and that
implementing Unicode support on Win9x is possible, but that Unicode support
is significantly different than under WinNT. Fortunately there are libraries
that address some of these issues.

Regards,

Addison
        __________________________________________

        Addison Phillips
        Director, Globalization Services
        SimulTrans, L.L.C.
        2606 Bayshore Parkway
        Mountain View, California 94043 USA

        +1 650-526-4652 (direct telephone)
        +1 650-969-9959 (facsimile)
        AddisonP@simultrans.com (Internet email)
        http://www.simultrans.com (website)

        "22 languages. One release date."
        __________________________________________

-----Original Message-----
From: Randy Hughes [mailto:hughesr@unidial.com]
Sent: Wednesday, April 28, 1999 11:14 AM
To: Unicode List
Subject: Unicode Searching

I have written a Searching application for Windows. I am interested in
adding Unicode searching capability to it. Can someone give me a brief
list of issues to consider, or point me to a good starting point for adding
this capability. If you need to see the product it can be downloaded from
my website listed below. It will currently handle only single-byte, and I
am trying to figure out how to get it to Double-Byte.

Thanks
Randy Hughes
Jr Computing
http://www.jrcomputing.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT