Re: Unicode String Models from martin_at_v.loewis.de on 2012-07-20 (Unicode Mail List Archive)

From: <martin_at_v.loewis.de>
Date: Sat, 21 Jul 2012 01:14:10 +0200

> "That means that it is best to optimize for BMP characters (and as a
> subset, ASCII and Latin-1), and fall into a ‘slow path’ when a
> supplementary character is encountered."
>
> I'm concerned about the statement/implication that one can optimize
> for ASCII and Latin-1. It's too easy for a lot of developers to test
> speed with the English/European documents they have around and test
> correctness only with Chinese.

I don't think this is a concern within the context of the posting.
He is talking about Unicode String Models, something that most developers
will never have to design themselves - instead, they use what the
language gives them.

People implementing Unicode support for programming languages, in
turn, typically will be aware of all issues.

> I can see for i in range (1, 1000) do a := " "; a +:= "龜"; done being
> way slower than necessary (especially for non-trivially optimized away
> cases), for example.

Why is that? Take Python 3.3, for example. It does optimize for ASCII,
so the first string will use only 1 byte for the space, and two bytes
for 龜 (both in a string literal, which is already stored in a constant
string object).

The concatenation determines that the result string will need two bytes
per char, and will have two chars, so it allocates a string being able
to hold four bytes. It then copies the space (widening the representation),
and the other character (as-is). I don't see why this is slower than
necessary.

> "Interfacing with most software libraries can avoid conversions in and out"
>
> I'm curious about this. I won't dismiss it off hand, but besides ICU,
> what libraries are we talking about that haven't already been
> rewritten for GTK, Java, Python, take your pick.

"rewritten for"? None. Besides perhaps XML parsers, I don't think many
libraries have been rewritten *for* Python, none for Gtk, and many not
for Java. Take database adapters, for example. To access MySQL, Postgres,
Oracle, or SQLite, you often need to use the C library of the database
vendor, which then got integrated (e.g. through some FFI) into GTK,
Java, and Python. However, this FFI integration is where the conversions
in and out need to be performed.

> "The question of whether to allow non-ASCII characters in variables is open."
>
> I don't see why.

Do you factually disagree that there is no universal consensus on this
question? Some languages support non-ASCII identifiers, but many more
don't, and proponents of those languages often claim that such support
isn't really needed. So I'd agree that the question is still undecided,
i.e. open.

Regards,
Martin
Received on Fri Jul 20 2012 - 18:19:18 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 20 2012 - 18:19:36 CDT