Re: Unicode String Models

From: Mark Davis ☕️ via Unicode <>
Date: Tue, 2 Oct 2018 14:03:38 +0200


On Sun, Sep 9, 2018 at 10:03 AM Richard Wordingham via Unicode <> wrote:

> On Sat, 8 Sep 2018 18:36:00 +0200
> Mark Davis ☕️ via Unicode <> wrote:
> > I recently did some extensive revisions of a paper on Unicode string
> > models (APIs). Comments are welcome.
> >
> >
> Theoretically at least, the cost of indexing a big string by codepoint
> is negligible. For example, cost of accessing the middle character is
> O(1)*, not O(n), where n is the length of the string. The trick is to
> use a proportionately small amount of memory to store and maintain a
> partial conversion table from character index to byte index. For
> example, Emacs claims to offer O(1) access to a UTF-8 buffer by
> character number, and I can't significantly fault the claim.
> *There may be some creep, but it doesn't matter for strings that can be
> stored within a galaxy.
> Of course, the coefficients implied by big-oh notation also matter.
> For example, it can be very easy to forget that a bubble sort is often
> the quickest sorting algorithm.

Thanks, added a quote from you on that; see if it looks ok.

> You keep muttering that a a sequence of 8-bit code units can contain
> invalid sequences, but often forget that that is also true of sequences
> of 16-bit code units. Do emoji now ensure that confusion between
> codepoints and code units rapidly comes to light?

I didn't neglect that, had a [TBD] for it.

While UTF16 invalid unpaired surrogates don't complicate processing much if
they are treated as unassigned characters, allowing UTF8 invalid sequences
are more troublesome. See, for example, the convolutions needed in ICU
methods that allow ill-formed UTF8.

> You seem to keep forgetting that grapheme clusters are not how some
> people people work. Does the English word 'café' contain the letter
> 'e'? Yes or no? I maintain that it does. I can't help thinking that
> one might want to look for the letter 'ă' in Vietnamese and find it
> whatever the associated tone mark is.

I'm pretty familiar with the situation, thanks for asking.

Often you want to find out more about the components of grapheme clusters,
so you always need to be able to iterate through the code points it
contains. One might think that iterating by grapheme cluster is hiding
features of the text. For example, with *fox́* (fox\u{301}) it is easy to
find that the text contains an *x* by iterating through code points. But
code points often don't reveal their components: does the word
*también* contain
the letter *e*? A reasonable question, but iterating by code point rather
than grapheme cluster doesn't help, since it is typically encoded as a
single U+00E9. And even decomposing to NFD doesn't always help, as with
cases like *rødgrød*.

> You didn't discuss substrings.

I did. But if you mean a definition of substring that lets you access
internal components of substrings, I'm afraid that is quite a specialized
usage. One could do it, but it would burden down the general use case.

> I'm interested in how subsequences of
> strings are defined, as the concept of 'substring' isn't really Unicode
> compliant. Again, expressing 'ă' as a subsequence of the Vietnamese
> word 'nặng' ought to be possible, whether one is using NFD (easier) or
> NFC. (And there are alternative normalisations that are compatible
> with canonical equivalence.) I'm most interested in subsequences X of a
> word W where W is the same as AXB for some strings A and B.

> Richard.
Received on Tue Oct 02 2018 - 07:05:35 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 02 2018 - 07:05:35 CDT