Re: Decomposable characters with marks (or other combining characterrs)...

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Aug 06 1999 - 17:17:21 EDT


Mr. Susnjar,

>
> Decomposing characters is easy.

Well, actually it is a little more complex that you seem to think, or
some of your questions would already have answers for you. Make
sure you fully understand section 3.9 of the Unicode Standard, and
the nature of recursive application of the decomposition mappings
in the Unicode Character Database data file UnicodeData.txt.

> Lowercasing them too (if they have a lowercase version).
> If they are upper- or title- case, they can stay like that, as well
> and will not make any problems to our search engine.

If they are upper- or title- case, and you are lowercasing for your
normalization, they should end up lowercase.

> What is ambiguous to us is what should we do in the following cases:
>
> 1. Decomposable character (that decomposes to few letter-characters)
> is followed by mark(s) (combining character(s)). Where do marks apply?
> What if the same mark is specified multiple times?

This is not at all ambiguous if you follow the specification for canonical
decomposition in the standard.

>
> Example:
>
> [01F1] [030C]
> should be decomposed (during regularization) to:
>
> a) [0044] [005A] [030C]
> b) [0044] [030C] [005A]
> c) [0044] [030C] [005A] [030C]
> d) something else... what?

Canonical decomposition is: 01F1 030C
Compatibility decomposition is: 0044 005A 030C

>
> Translated:
> [DZ] [Combining Caron]
>
> a) [D] [Z] [Combining Caron]
> b) [D] [Combining Caron] [Z]
> c) [D] [Combining Caron] [Z] [Combining Caron]
> d) something else... what?
>
> -------------------------------------------------------------------
>
> 2. Decomposable character (that decomposes to a character and a mark)
> is followed by another (or even same?) mark(s) (combining character(s)).
> What if the marks are the same? Should we 'colapse' the two same marks
> into one?
>
> Example:
>
> [01C4] [030C]
>
> where:
>
> [01C4] -> [0044] [017D]
> &
> [017D] -> [005A] [030C]
>
> should be decomposed (during regularization) to:
>
> a) [0044] [005A] [030C] [030C]
> b) [0044] [005A] [030C]
> c) [0044] [030C] [005A] [030C]
> d) [0044] [005A]
> e) something else... what?

Canonical decomposition is: 01C4 030C
Compatibility decomposition is: 0044 005A 030C 030C

>
> Translation:
>
> [DZ with Caron] [Combining Caron]
>
> where:
>
> [DZ with Caron] -> [D] [Z with Caron]
> &
> [Z with Caron] -> [Z] [Combining Caron]
>
> should be decomposed (during regularization) to:
>
> a) [D] [Z] [Combining Caron] [Combining Caron]
> b) [D] [Z] [Combining Caron]
> c) [D] [Combining Caron] [Z] [Combining Caron]
> d) [D] [Z]
> e) something else... what?
>
> -------------------------------------------------------------------
>
> 2. Decomposable character (that decomposes to multiple other
> characters of the same kind, e.g. 0xFDFA - "ARABIC LIGATURE SALLALLAHOU
> ALAYHE WASALLAM" - it decomposes to three words!) is followed by another
> combining/mark character(s).

Follow the same rules as above. The application of combining marks is
cumulative, and the decompositions are the result of simple recursive applications
of the decomposition mappings.

>
> -------------------------------------------------------------------
>
> 3. Surrogate characters - are they (some of them) decomposable? Can they
> be regularized? Where can we find tables?

There are no characters currently encoded in Unicode 2.1 or in Unicode 3.0
via surrogates. When such character are encoded in Unicode 4.0 and later,
their decompositions will be included in the Unicode Character Database,
along with all the other assigned characters.

>
> -------------------------------------------------------------------
>
> 4. Are there such regularization/equivalence tables for CJKV Ideographs?
> If yes, where? Are there equivalent CJKV Ideographs - some books say that
> there are! They mention that there are even 20 different characters for the
> same meaning and pronounciation! Is this true?

The correct behavior for the first level of normalization, as you are
talking about, is to simply make use of the equivalences in the Unicode
Character Database to deal with the 302 compatibility CJK characters in
the standard.

Beyond that, you will need to acquire additional tables if you want to
start making equivalences between simplified and traditional forms, for
example. Much of that data can be extracted from the Unihan.txt data
table available on the website.

And rather than just being uncertain what the situation is, you need to
make sure you acquire some inhouse or consulting expertise on CJKV
ideographs. I suggest you start by making sure you have a copy of Ken
Lunde's book, CJKV Information Processing.

>
> -------------------------------------------------------------------
>
> 5. Even though Unicode 2.1 and Unicode 3.0 do not use surrogates, will they
> allocate the space used for surrogates for something else or is this space
> reserved for this purpose?

Surrogate space is eternally reserved for nothing but use of surrogate code points
to encode characters in the range U-00010000 .. U-0010FFFF in the future.

>
> -------------------------------------------------------------------
>
> 6. Should surrogate characters be encoded in UTF8 (for web use in
> Netscape 4+ and IE4+) using two UTF8 sequences (one for each part/half
> of the surrogate character, this results to six bytes) or using an
> extended 4-byte or 5-byte UTF8 encoding of some UCS4 space? How
> are surrogates mapped in UCS4? Is UCS4 == Unicode for non-surrogate characters?

All of this is explained in detail in the Standard. See pp. A-7 ff of
the Unicode Standard, Version 2.0. The answer is that you *must* use
the 4-byte form of UTF-8 to represent a surrogate pair.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT