Re: Compression through normalization

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Dec 05 2003 - 17:50:00 EST

  • Next message: Michael Everson: "Re: Missing African Latin letters (bis)"

    On 05/12/2003 14:01, Philippe Verdy wrote:

    > ...
    >
    >It's just a shame that what was considered as equivalent in the Korean
    >standards is considered as canonically distinct (and even compatibility
    >dictinct) in Unicode. This means that the same exact abstract Korean text
    >can have two distinct representation in Unicode and there's no way to match
    >these Unicode representations together. And also that whan mapping Korean
    >charsets to Unicode, care must be done, before making the mapping, that all
    >compound jamaos will be used each time it is possible.
    >
    >
    Agreed.

    >If now the text is stored and handled entirely in Unicode without returning
    >to the KSC standard, you won't have any other tool than just UCA to collate
    >strings (but collation does not produces strings, just collation weights,
    >and there's currently no tool to reverse a list of weights back to an
    >Unicode string...
    >
    >...
    >
    I note the following which is part of the text explaining C10:

    > All processes and higher-level protocols are required to abide by C10
    > as a minimum.
    > However, higher-level protocols may define additional equivalences
    > that do not
    > constitute modifications under that protocol. For example, a
    > higher-level protocol
    > may allow a sequence of spaces to be replaced by a single space.

    Presumably a higher level protocol could transform Korean text into a
    standardised form, doing what (in your opinion and mine at least)
    Unicode normalisation ought to have done.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Fri Dec 05 2003 - 18:31:55 EST