From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Dec 12 2003 - 06:44:51 EST
On 11/12/2003 17:55, Philippe Verdy wrote:
>Peter Kirk wrote:
>
>
>>I am sure that some tricks could be found to
>>simplify the indexing if necessary, e.g. using PUA or non-character code
>>points indexed into a special table to replace DGCs which cannot be
>>represented as a single character. (There are plenty of non-characters
>>available as you need to use UTF-32 here to avoid exactly the same
>>problems with surrogates.)
>>
>>
>
>You're quite optimistic here: the total number of DGCs that can be encoded
>in Unicode goes far beyond the capacity of PUAs and even of the whole
>Unicode range itself.
>
>I did not try to count them for the simplest cases, but possible DGCs are
>nearly infinite:
>- there's no upper limit for the number of diacritics you can combine with a
>base character
>- there's no limit in the number of base characters that can be used to
>build Hangul syllables.
>
>
More than that, actually infinite, as any one diacritic may be repeated.
>So how will you allocate PUAs? Using an internal lookup table stored with
>the document that use these PUAs that translates only the DGCs used
>internally into single PUAs ? ...
>
Well, I wasn't actually thinking of storing these with the document,
although I suppose they could be if I were to choose an approach which I
don't like of storing documents in a private format. (This wouldn't even
be an efficient format if I am mostly using UTF-32.) I was thinking
rather of translating complex DGCs into PUAs etc on input of each
document individually, and keeping in memory a table mapping these PUAs
to character strings. Actually it is probably better in this case to use
non-characters as there may be PUAs in the document already, and this
avoids some of the problems you noted. As I have 65519 whole planes of
non-characters available which can support more than 4 billion distinct
DGCs, I think I will have enough space for any practical document.
>... Now how will you implement indexing with these
>private private PUAs which change of semantics across documents? What is the
>relevant scope for these PUAs?
>
>
The scope would be one instance of a document opened in an application.
As for implementation details, that is for implementers to sort out.
This was a tentative suggestion which I made in passing, not something
which I had thought through in detail.
In the 19th century Charles Babbage wrote, concerning his prototype
computers:
> Propose to an Englishman any principle, or any instrument, however
> admirable, and you will observe that the whole effort of the English
> mind is directed to find a difficulty, a defect, or an impossibility
> in it.
I regret that we English may have exported this unfortunate trait.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 07:28:15 EST