Re: Decomposed vs Composed accented characters

From: Mike Ayers (mayers@celequest.com)
Date: Thu Apr 06 2006 - 15:43:20 CST

Next message: Peter Edberg: "Re: CLDR: Bad exemplar chars for some locales"

Previous message: Jukka K. Korpela: "Re: CLDR: Bad exemplar chars for some locales [ar,fa]"
In reply to: Tay, William: "Decomposed vs Composed accented characters"
Next in thread: Otto Stolz: "Re: Decomposed vs Composed accented characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Tay, William wrote:
> Hi,
>
> I have a C/C++ UNIX application that uses standard UTF-8 as the internal
> text encoding. If it receives a UTF-8 encoded decomposed accented
> character, i.e. base character + accent, from a MacOS X application, it
> would need to be able to detect that the character was decomposed, and
> then compose it prior to further processing. Is there any Solaris/UNIX
> utility or functions that can help my application do the detection and
> character composition?

You should take a look at ICU at http://icu.sourceforge.net. It does
what you need and a lot of things you may not have thought of yet.

> Now, the application from which the decomposed accented character
> originated may query my application so that the character is returned to
> it. If my application has already composed the character, won't it be a
> problem for the querying application, since it expects to receive the
> character in its decomposed format?

If the applications are treating Unicode strings as binary data, as
some applications, most notably many file systems, do, then you may want
to preserve the original value alongside the normalized form. This
approximately doubles storage requirements. You could, instead,
normalize as needed if it is computationally affordable.

> Can accented characters be decomposed in other encodings, e.g. ISO
> 8859-1, as well?
>
> Btw, what common applications/operating systems generate decomposed
> accented characters?

I don't know, but I the preserve+normalize strategy should eliminate
these concerns.

HTH,

/|/|ike

Next message: Peter Edberg: "Re: CLDR: Bad exemplar chars for some locales"
Previous message: Jukka K. Korpela: "Re: CLDR: Bad exemplar chars for some locales [ar,fa]"
In reply to: Tay, William: "Decomposed vs Composed accented characters"
Next in thread: Otto Stolz: "Re: Decomposed vs Composed accented characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Apr 06 2006 - 15:46:38 CST