Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Dec 12 2003 - 07:56:43 EST

Next message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

Previous message: Marco Cimarosti: "[OT?] The C standard library and UTF's (was RE: Text Editors and Canonical Equivalence (was Coloured diacritics))"
In reply to: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Andrew C. West: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 12/12/2003 04:29, Philippe Verdy wrote:

> ...
>
>But what you suggest here is exactly what a standard file compressor does.
>
>It does not solve any problem in the representation of characters, the
>compression scheme remains private, and can only be interpreted as text by
>redecomposing these PUAs (in their scope) to the appropriate complex DGCs.
>In addition, you need to find a way to store these associations between PUAs
>and DGCs, so the complexity is even worse.
>
>You would probably use it only if there are multiple occurences of these
>complex DGCs, just to save some space (this is what is performed in the
>Hangul Johab syllables as they occur very frequently when writing modern
>Korean, and the space benefit comes from the fact that it does not need to
>encode the associations between syllables and DGCs of jamos, as this is
>defined by their canonical equivalences and implemented with a very basic
>algorithm).
>
>So unless you can create such simple algorithm to map complex DGC with PUA
>ranges, there's little use of what you propose here.
>
>
This is not intended as a file compression technique. (Indeed it would
be an extremely poor one as it is based on UTF-32!) It is intended only
to solve the problem Mark mentioned that indexing etc of strings is
inefficient when the string is counted and divided according to grapheme
clusters - according to the recommendations for editing in UAX #29. The
mechanism I proposed was intended to allow a string of grapheme clusters
to be indexed efficiently, and nothing else - although as you point out
it might also help with rendering (although not neccessarily, as the
same grapheme cluster is not always rendered the same e.g. in Arabic).

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Previous message: Marco Cimarosti: "[OT?] The C standard library and UTF's (was RE: Text Editors and Canonical Equivalence (was Coloured diacritics))"
In reply to: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Andrew C. West: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 08:36:43 EST