Re: Decomposed vs Composed accented characters

From: Mike Ayers (mayers@celequest.com)
Date: Thu Apr 06 2006 - 15:43:20 CST

  • Next message: Peter Edberg: "Re: CLDR: Bad exemplar chars for some locales"

    Tay, William wrote:
    > Hi,
    >
    > I have a C/C++ UNIX application that uses standard UTF-8 as the internal
    > text encoding. If it receives a UTF-8 encoded decomposed accented
    > character, i.e. base character + accent, from a MacOS X application, it
    > would need to be able to detect that the character was decomposed, and
    > then compose it prior to further processing. Is there any Solaris/UNIX
    > utility or functions that can help my application do the detection and
    > character composition?

            You should take a look at ICU at http://icu.sourceforge.net. It does
    what you need and a lot of things you may not have thought of yet.

    > Now, the application from which the decomposed accented character
    > originated may query my application so that the character is returned to
    > it. If my application has already composed the character, won't it be a
    > problem for the querying application, since it expects to receive the
    > character in its decomposed format?

            If the applications are treating Unicode strings as binary data, as
    some applications, most notably many file systems, do, then you may want
    to preserve the original value alongside the normalized form. This
    approximately doubles storage requirements. You could, instead,
    normalize as needed if it is computationally affordable.

    > Can accented characters be decomposed in other encodings, e.g. ISO
    > 8859-1, as well?
    >
    > Btw, what common applications/operating systems generate decomposed
    > accented characters?

            I don't know, but I the preserve+normalize strategy should eliminate
    these concerns.

            HTH,

    /|/|ike



    This archive was generated by hypermail 2.1.5 : Thu Apr 06 2006 - 15:46:38 CST