Re: Unicode String Models from Mark Davis ☕️ via Unicode on 2018-10-02 (Unicode Mail List Archive)

From: Mark Davis ☕️ via Unicode <unicode_at_unicode.org>
Date: Tue, 2 Oct 2018 14:03:48 +0200

Mark

On Sun, Sep 9, 2018 at 3:42 PM Daniel Bünzli <daniel.buenzli_at_erratique.ch>
wrote:

> Hello,
>
> I find your notion of "model" and presentation a bit confusing since it
> conflates what I would call the internal representation and the API.
>
> The internal representation defines how the Unicode text is stored and
> should not really matter to the end user of the string data structure. The
> API defines how the Unicode text is accessed, expressed by what is the
> result of an indexing operation on the string. The latter is really what
> matters for the end-user and what I would call the "model".
>

Because of performance and storage consideration, you need to consider the
possible internal data structures when you are looking at something as
low-level as strings. But most of the 'model's in the document are only
really distinguished by API, only the "Code Point model" discussions are
segmented by internal storage, as with "Code Point Model: UTF-32"

> I think the presentation would benefit from making a clear distinction
> between the internal representation and the API; you could then easily
> summarize them in a table which would make a nice summary of the design
> space.
>

That's an interesting suggestion, I'll mull it over.

>
> I also think you are missing one API which is the one with ECG I would
> favour: indexing returns Unicode scalar values, internally be it whatever
> you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended
> by the "Code Point Model: Internal 8/16/32" but that's not what it says,
> the distinction between code point and scalar value is an important one and
> I think it would be good to insist on it to clarify the minds in such
> documents.
>

In reality, most APIs are not even going to be in terms of code points:
they will return int32's. So not only are they not scalar values,
99.97% are not even code points. Of course, values above 10FFFF or below 0
shouldn't ever be stored in strings, but in practice treating
non-scalar-value-code-points as "permanently unassigned" characters doesn't
really cause problems in processing.

> Best,
>
> Daniel
>
>
>
Received on Tue Oct 02 2018 - 07:05:59 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 02 2018 - 07:05:59 CDT