Re: NNBSP from Asmus Freytag via Unicode on 2019-01-18 (Unicode Mail List Archive)

From: Asmus Freytag via Unicode <unicode_at_unicode.org>
Date: Fri, 18 Jan 2019 16:55:07 -0800

On 1/18/2019 2:05 PM, Marcel Schneider via Unicode wrote:

On 18/01/2019 20:09, Asmus Freytag via Unicode wrote:

Marcel,

about your many detailed *technical* questions about the history of character properties, I am afraid I have no specific recollection.

Other List Members are welcome to join in, many of whom are aware of how things happened. My questions are meant to be rather simple. Summing up the premium ones:

Why does UTC ignore the need of a non-breakable thin space?

Why did UTC not declare PUNCTUATION SPACE non-breakable?

A less important information would be how extensively typewriters with proportional advance width were used to write books ready for print.

Another question you do answer below:

French is not the only language that uses a space to group figures. In fact, I grew up with thousands separators being spaces, but in much of the existing publications or documents there was certainly a full (ordinary) space being used. Not surprisingly, because in those years documents were typewritten and even many books were simply reproduced from typescript.

When it comes to figures, there are two different types of spaces.

One is a space that has the same width a digit and is used in the layout of lists. For example, if you have a leading currency symbol, you may want to have that lined up on the left and leave the digits representing the amounts "ragged". You would fill the intervening spaces with this "lining" space character and everything lines up.

That is exactly how I understood hot-metal typesetting of tables. What surprises me is why computerized layout does work the same way instead of using tabulations and appropriate tab stops (left, right, centered, decimal [with all decimal separators lining up vertically).

==> At the time Unicode was first created (and definitely before that, during the time of non-universal character sets) many applications existed that used a "typewriter model" and worked by space fill rather than decimal-point tabulation.

From today's perspective that older model is inflexible and not the best approach, but it is impossible to say how long this legacy approach hung on in some places and how much data might exist that relied on certain long-standing behaviors of these space characters.

For a good solution, you always need to understand

(1) the requirement of your "index" case (French, in this case)

(2) how it relates to similar requirements in (all!) other languages / scripts

(3) how it relates to actual legacy practice

(3a) what will suddenly no longer work if you change the properties on some character

(3b) what older data will no longer work if the effective behavior of newer applications changes

In lists like that, you can get away with not using a narrow thousands separator, because the overall context of the list indicates which digits belong together and form a number. Having a narrow space may still look nicer, but complicates the space fill between the symbol and the digits.

It does not, provided that all numbers have thousands separators, even if filling with spaces. It looks nicer because it’s more legible.

Now for numbers in running text using an ordinary space has multiple drawbacks. It's definitely less readable and, in digital representation, if you use 0020 you don't communicate that this is part of a single number that's best not broken across lines.

Right.

The problem Unicode had is that it did not properly understand which of the two types of "numeric" spaces was represented by "figure space". (I remember that we had discussions on that during the early years, but that they were not really resolved and that we moved on to other issues, of which many were demanding attention).

You were discussing whether the thousands separator should have the width of a digit or the width of a period? Consistently with many other choices, the solution would have been to encode them both as non-breakable, the more as both were at hand, leaving the choice to the end-user.

==> Right, but remember, we started off encoding a set of spaces that existed before Unicode (in some other character sets) and implicitly made the assumption that those were the correct set (just like we took punctuation from ASCII and similar sources and only added to it later, when we understood that they were missing things --- generally always added, generally did not redefine behavior or shape of existing code points).

Current practice in electronic publishing was to use a non-breakable thin space, Philippe Verdy reports. Did that information come in somehow?

==> probably not in the early days. Y

ISO 31-0 was published in 1992, perhaps too late for Unicode. It is normally understood that the thousands separator should not have the width of a digit. The allaged reason is security. Though on a typewriter, as you state, there is scarcely any other option. By that time, all computerized text was fixed width, Philippe Verdy reports. On-screen, I figure out, not in book print

==> much book printing was also done by photomechanically reproducing typescript at that time. Not everybody wanted to pay typesetters and digital typesetting wasn't as advanced. I actually did use a digital phototypesetter of the period a few years before I joined Unicode, so I know. It was more powerful than a typewriter, but not as powerful as TeX or later the Adobe products.

For one, you didn't typeset a page, only a column of text, and it required manual paste-up etc.

If you want to do the right thing you need:

(1) have a solution that works as intended for ALL language using some form of blank as a thousands separator - solving only the French issue is not enough. We should not do this a language at a time.

That is how CLDR works.

CLDR data is by definition per-language. Except for inheritance, languages are independent.

There are no "French" characters. When you encode characters, at best, some code points may be script-specific. For punctuation and spaces not even that may be the case. Therefore, as long as you try to solve this as if it only was a French problem, you are not doing proper character encoding.

Do you have colleagues in Germany and other countries that can confirm whether their practice matches the French usage in all details, or whether there are differences? (Including differently acceptability of fallback renderings...).

No I don’t but people may wish to read German Wikipedia:

https://de.wikipedia.org/wiki/Zifferngruppierung#Mit_dem_Tausendertrennzeichen

Shared in ticket #11423:
https://unicode.org/cldr/trac/ticket/11423#comment:15

==> for your proposal to be effective, you need to reach out.

(2) have a solution that works for lining figures as well as separators.

(3) have a solution that understands ALL uses of spaces that are narrower than normal space. Once a character exists in Unicode, people will use it on the basis of "closest fit" to make it do (approximately) what they want. Your proposal needs to address any issues that would be caused by reinterpreting a character more narrowly that it has been used. Only by comprehensively identifying ALL uses of comparable spaces in various languages and scripts, you can hope to develop a solution that doesn't simply break all non-French text in favor of supporting French typography.

There is no such problem except that NNBSP has never worked properly in Mongolian. It was an encoding error, and that is the reason why to date, all font developers unanimously request the Mongolian Suffix Connector. That leaves the NNBSP for what it is consistently used outside Mongolian: a non-breakable thin space, kind of a belated avatar

of what PUNCTUATION SPACE should have been since the beginning.

==> I mentioned before that if something is universally "broken" it can sometimes be resurrected, because even if you change its behavior retroactively, it will not change something that ever worked correctly. (But you need to be sure that nobody repurposed the NNBSP for something useful that is different from what you intend to use it for, otherwise you can't change anything about it).

If, however, you are merely adding a use for some existing character that does not affect its properties, that is usually not as much of a problem - as long as we can have some confidence that both usages will continue to be possible.

Perhaps you see why this issue has languished for so long: getting it right is not a simple matter.

Still it is as simple as not skipping PUNCTUATION SPACE when FIGURE SPACE was made non-breakable. Now we ended up with a mutated Mongolian Space that does not work properly for Mongolian, but does for French and other Latin script using languages. It would even more if TUS was blunter, urging all foundries to update their whole catalogue soon.

==> You realize that I'm giving you general advice here, not something utterly specific to NNBSP - I don't have the inputs and background to know whether your approach is feasible or perhaps the best possible?

As for PUNCTUATION SPACE - some of the spaces have acquired usage in math (as part of the added math support in Unicode 3.2). We need to be sure that the assumptions about these that may have been made in math typesetting are not invalidated.

Not sure offhand whether UTR#25 captures all of that, but if you ever feel like proposing a property change you MUST research that first (with the current maintainers of that UTR or other experts).

This is the way Unicode is different from CLDR.

A./

Received on Fri Jan 18 2019 - 18:55:22 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 18 2019 - 18:55:22 CST