Re: no more precomposed characters for 1:1 conversion

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Dec 02 2003 - 12:18:35 EST

  • Next message: Arcane Jill: "RE: MS Windows and Unicode 4.0 ?"

    Peter Jacobi wrote:
    > Markus Scherer <markus.scherer@jtcsv.com> wrote:
    >
    >>ICU 2.8 has the ability to handle m:n character conversion mappings driven
    >>by simple lines in
    >>Unicode conversion tables (text files).
    >
    > That's a nice coincidence, to have this feature. I was wondering
    > if this would enable transcoding from legacy Tamil charsets (in visual
    > glyph order, like Thai) to Unicode.

    Possible, but this is "just" m:n character conversion. This feature does not add arbitrary text
    reordering. If you can achieve what you need with a set of m:n mappings, then you can use it by itself.

    Otherwise you would have to do line/paragraph chunking and use, for example, the ICU Transliterator
    classes for arbitrary Unicode-to-Unicode transforms after converting to or before converting out of
    Unicode.

    > I've looked at the example data files for the m:n mappings but
    > it's still opaque to me, what hat to go in the headers. Is there a
    > point to start reading from to gain further insights?

    There will be by the time ICU 2.8 is released, and it will be in the User Guide. Sorry for not
    having written that yet.

    However, there is actually nothing you need to do in the header. The makeconv tool will detect that
    you have multiple code points and/or multiple complete codepage character byte sequences and
    automatically put such mappings into an appropriate data structure. This is possible because it
    knows the structure of the codepage from the already necessary header information. (The structure of
    Unicode is known anyway, and trivial in .ucm files where code points are listed.)

    > I'm especially wondering, whether the converter by default will
    > take the longest matching entry in an m:n table or whether
    > the sequence of entries is significant. (Something must be done
    > to e.g. disambiguate keLa from kau).

    The sequence of entries is not significant. makeconv will sort the mappings internally for
    processing before the binary table is written.

    The converter must and will use the longest match - otherwise it would not be able to handle Ka vs.
    Ka+semi-voiced-mark in the Japanese table.

    For more contrived examples, see the test files test3.ucm and test4.ucm in icu/source/test/testdata/

    Best regards,
    markus



    This archive was generated by hypermail 2.1.5 : Tue Dec 02 2003 - 13:15:07 EST