Re: Decomposable characters with marks (or other combining characterrs)...

From: Mark Davis (mark@macchiato.com)
Date: Sun Aug 08 1999 - 03:01:42 EDT


I agree completely with Ken. You should also look at Unicode Technical Report #15,
Unicode Normalization Forms (http://www.unicode.org/unicode/reports/tr15/), especially
the section on Programming Language Identifiers for mention of case mappings.

Mark

Kenneth Whistler wrote:

> Mr. Susnjar,
>
> >
> > Decomposing characters is easy.
>
> Well, actually it is a little more complex that you seem to think, or
> some of your questions would already have answers for you. Make
> sure you fully understand section 3.9 of the Unicode Standard, and
> the nature of recursive application of the decomposition mappings
> in the Unicode Character Database data file UnicodeData.txt.
>
> > Lowercasing them too (if they have a lowercase version).
> > If they are upper- or title- case, they can stay like that, as well
> > and will not make any problems to our search engine.
>
> If they are upper- or title- case, and you are lowercasing for your
> normalization, they should end up lowercase.
>
> > What is ambiguous to us is what should we do in the following cases:
> >
> > 1. Decomposable character (that decomposes to few letter-characters)
> > is followed by mark(s) (combining character(s)). Where do marks apply?
> > What if the same mark is specified multiple times?
>
> This is not at all ambiguous if you follow the specification for canonical
> decomposition in the standard.
>
> >
> > Example:
> >
> > [01F1] [030C]
> > should be decomposed (during regularization) to:
> >
> > a) [0044] [005A] [030C]
> > b) [0044] [030C] [005A]
> > c) [0044] [030C] [005A] [030C]
> > d) something else... what?
>
> Canonical decomposition is: 01F1 030C
> Compatibility decomposition is: 0044 005A 030C
>
> >
> > Translated:
> > [DZ] [Combining Caron]
> >
> > a) [D] [Z] [Combining Caron]
> > b) [D] [Combining Caron] [Z]
> > c) [D] [Combining Caron] [Z] [Combining Caron]
> > d) something else... what?
> >
> > -------------------------------------------------------------------
> >
> > 2. Decomposable character (that decomposes to a character and a mark)
> > is followed by another (or even same?) mark(s) (combining character(s)).
> > What if the marks are the same? Should we 'colapse' the two same marks
> > into one?
> >
> > Example:
> >
> > [01C4] [030C]
> >
> > where:
> >
> > [01C4] -> [0044] [017D]
> > &
> > [017D] -> [005A] [030C]
> >
> > should be decomposed (during regularization) to:
> >
> > a) [0044] [005A] [030C] [030C]
> > b) [0044] [005A] [030C]
> > c) [0044] [030C] [005A] [030C]
> > d) [0044] [005A]
> > e) something else... what?
>
> Canonical decomposition is: 01C4 030C
> Compatibility decomposition is: 0044 005A 030C 030C
>
> >
> > Translation:
> >
> > [DZ with Caron] [Combining Caron]
> >
> > where:
> >
> > [DZ with Caron] -> [D] [Z with Caron]
> > &
> > [Z with Caron] -> [Z] [Combining Caron]
> >
> > should be decomposed (during regularization) to:
> >
> > a) [D] [Z] [Combining Caron] [Combining Caron]
> > b) [D] [Z] [Combining Caron]
> > c) [D] [Combining Caron] [Z] [Combining Caron]
> > d) [D] [Z]
> > e) something else... what?
> >
> > -------------------------------------------------------------------
> >
> > 2. Decomposable character (that decomposes to multiple other
> > characters of the same kind, e.g. 0xFDFA - "ARABIC LIGATURE SALLALLAHOU
> > ALAYHE WASALLAM" - it decomposes to three words!) is followed by another
> > combining/mark character(s).
>
> Follow the same rules as above. The application of combining marks is
> cumulative, and the decompositions are the result of simple recursive applications
> of the decomposition mappings.
>
> >
> > -------------------------------------------------------------------
> >
> > 3. Surrogate characters - are they (some of them) decomposable? Can they
> > be regularized? Where can we find tables?
>
> There are no characters currently encoded in Unicode 2.1 or in Unicode 3.0
> via surrogates. When such character are encoded in Unicode 4.0 and later,
> their decompositions will be included in the Unicode Character Database,
> along with all the other assigned characters.
>
> >
> > -------------------------------------------------------------------
> >
> > 4. Are there such regularization/equivalence tables for CJKV Ideographs?
> > If yes, where? Are there equivalent CJKV Ideographs - some books say that
> > there are! They mention that there are even 20 different characters for the
> > same meaning and pronounciation! Is this true?
>
> The correct behavior for the first level of normalization, as you are
> talking about, is to simply make use of the equivalences in the Unicode
> Character Database to deal with the 302 compatibility CJK characters in
> the standard.
>
> Beyond that, you will need to acquire additional tables if you want to
> start making equivalences between simplified and traditional forms, for
> example. Much of that data can be extracted from the Unihan.txt data
> table available on the website.
>
> And rather than just being uncertain what the situation is, you need to
> make sure you acquire some inhouse or consulting expertise on CJKV
> ideographs. I suggest you start by making sure you have a copy of Ken
> Lunde's book, CJKV Information Processing.
>
> >
> > -------------------------------------------------------------------
> >
> > 5. Even though Unicode 2.1 and Unicode 3.0 do not use surrogates, will they
> > allocate the space used for surrogates for something else or is this space
> > reserved for this purpose?
>
> Surrogate space is eternally reserved for nothing but use of surrogate code points
> to encode characters in the range U-00010000 .. U-0010FFFF in the future.
>
> >
> > -------------------------------------------------------------------
> >
> > 6. Should surrogate characters be encoded in UTF8 (for web use in
> > Netscape 4+ and IE4+) using two UTF8 sequences (one for each part/half
> > of the surrogate character, this results to six bytes) or using an
> > extended 4-byte or 5-byte UTF8 encoding of some UCS4 space? How
> > are surrogates mapped in UCS4? Is UCS4 == Unicode for non-surrogate characters?
>
> All of this is explained in detail in the Standard. See pp. A-7 ff of
> the Unicode Standard, Version 2.0. The answer is that you *must* use
> the 4-byte form of UTF-8 to represent a surrogate pair.
>
> --Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT