Re: NFC/NFKC Normalization Edge Case

From: Jeff Senn (senn@maya.com)
Date: Tue Sep 29 2009 - 11:45:19 CDT

Next message: announcements@unicode.org: "[Unicode Announcement] Unicode Haiku Contest"

Previous message: Roozbeh Pournader: "Re: Run-time checking of fonts for Sinhala support"
Maybe in reply to: Jeff Senn: "NFC/NFKC Normalization Edge Case"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Unless I'm mistaken your analysis misses the case where there can
be repeated combiners (or even multiple combiners of the same
class)...e.g.

NFC(0042 0300 0300 0300 0300 0323) --> 1E04 0300 0300 0300 0300

I guess there is no theoretical limit, hence the "need" for the
"Stream-Safe Text Format".
(see UAX 15 ch20)

On Sep 29, 2009, at 12:02 PM, verdy_p wrote:

> As long as you have not reached the end of stream or reached another
> character that has a lower ccc and that CAN be
> combined or be reordered with the existing characters in the buffer
> (up to and including the last starter), you
> can't presume that your sequence is terminated and you can't flush it.
>
> Fortunately, due to the limited number of possible ccc values that
> exist in the database (and given that your look-
> ahead buffer will never contain several characters with identical
> ccc values, so that your buffer will only contain
> characters sorted in ascending order of ccc values), this
> effectively limits the needed size for your buffer. In
> addition, this limit is much lower, given that you can easily
> predict what is the maximum length of existing
> sequences after which a further combining character could be
> combined with the first character of your buffer, or
> inserted between it and the other combining characters in your
> buffer :
>
> The longest sequences to keep in the buffer would be those starting
> with a starter S (with ccc=0), with one
> combining character in each of the existing combining classes (in
> ascending ccc), pending for a character with the
> highest ccc value possible that would be the last one.
>
> This length is necessarily limited, given that the maximum ccc value
> possible is 255 (if such combining character is
> defined, and gets encoded in a canonically combining pair, and if
> all the 256 values of ccc were allocated, you
> would need a buffer for up to 255 characters maximum (the first
> location being used by the starter, all the other
> 254 locations being used by combining characters of each class, and
> finally, when reading the last character with
> ccc=255 that could combine with the first one.
>
> Note however that in almost all cases, the used length in the 255-
> codepoints buffer will be very small, so adding a
> new character in the buffer should really use a simple insertion
> sort, or could even use a bubble sort :
>
> (1) first compare the ccc value of the new character with the
> existing ccc value of characters in the buffer in slot
> [0..N-1] (keep a variable for it) : if it is equal, then the new
> character is not blocked, and you can output the
> content of your buffer (you would have to sort it first by ascending
> ccc value, if you don't use the insertion
> exposed below for step 3), keeping just the new character in slot[0].
>
> (2) first compare the new character with the one in used slot[0] to
> see if it matches a canonical pair: if it
> matches, replace the character in slot[0] by the precombined
> character.
>
> (3) otherwise just place the new character in slot[N] (where N is
> the used length) and increment N.
>
> In this last step (3), you may (and should) also want to use a
> sliding insertion in backward direction (instead of
> just placing the new character at end of the buffer in slot[N]),
> creating the equivalent of an insertion sort, to
> avoid keeping the maximum ccc value within the N characters stored
> in the buffer : it will even be more efficient
> than sorting the buffer before outputing it (in step 1), because the
> buffer length will (almost always) be extremely
> small with length 0 or 1, in more than 99% of cases (the additional
> cost of "efficient" sorts like Quicksort is not
> worth the value on very small vectors that will constitute the huge
> majority of cases, even with the longest
> theoretical ones that are MUCH smaller in length than 255, with the
> existing UCD): insertion sorts are very
> efficient on very small vectors.
>
> If you do this, then the character in slot[0] (if there's one, i.e.
> when N>0) will be kept in the buffer only :
> * if it can effectively combine (when it has ccc=0) within an
> existing canonical pair of the UCD,
> * or if there are other characters in the UCD with a ccc value
> distinct from the ccc values of non-starter
> characters already in your buffer in slot[1 .. N-1].
>
> You can also perform the backward sliding for insertion in the same
> loop used to scan the content of your buffer for
> comparing ccc values: when you find the location of the insertion,
> you can immediately place the character at that
> location, given that the rest of the buffer is already slid:
>
> When this is done, the slot[i] of insertion is the one where you
> just store the codepoint, so every other prior
> stots in [0..i-1] can be output. Theoretically, you would want to
> slide back the rest of the buffer so that slot[i]
> will go back to slot[0] after you have output every character in
> slot[0 .. i-1], but the simplest is to not slide
> this content at all, but keep a persistant index of the location
> with the first used character.
>
>> Message du 23/09/09 16:04
>> De : "Jeff Senn"
>> A : unicode@unicode.org
>> Copie à :
>> Objet : Re: NFC/NFKC Normalization Edge Case
>>
>>
>>
>> On Sep 22, 2009, at 5:44 PM, Kenneth Whistler wrote:
>>>>
>>>> All of these characters have combining class 0. Can they be
>>>> canonically
>>>> combined? Even though the 2nd characters are NOT "combining"?
>>>
>>> There's the first mistake. Both of the 2nd characters in
>>> these sequences *ARE* combining:
>>
>> Thanks guys. Clear now.
>>
>> I was led astray by:
>>
>> "D1. A character S is a starter if it has a combining class of zero
>> in
>> the Unicode Character Database. Any other character is a non-
>> starter."
>>
>> and
>>
>> "In some implementations, people may be working with streaming
>> interfaces that read and write small amounts at a time. In those
>> implementations, the text back to the last starter needs to be
>> buffered. Whenever a second starter would be added to that buffer,
>> the
>> buffer can be flushed."
>>
>> Which are technically correct (once you know the correct answer),
>> but led me to suspect I could do slightly more aggressive flushing of
>> a streamed buffer...
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>

Next message: announcements@unicode.org: "[Unicode Announcement] Unicode Haiku Contest"
Previous message: Roozbeh Pournader: "Re: Run-time checking of fonts for Sinhala support"
Maybe in reply to: Jeff Senn: "NFC/NFKC Normalization Edge Case"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Sep 29 2009 - 11:50:41 CDT