Re: Standaridized variation sequences for the Desert alphabet? from Michael Everson on 2017-03-28 (Unicode Mail List Archive)

From: Michael Everson <everson_at_evertype.com>
Date: Tue, 28 Mar 2017 14:56:28 +0100

On 28 Mar 2017, at 11:39, Martin J. Dürst <duerst_at_it.aoyama.ac.jp> wrote:

>> And what would the value of this be? Why should I (who have been doing this for two decades) not be able to use the word “character” when I believe it correct? Sometimes you people who have been here for a long time behave as though we had no precedent, as though every time a character were proposed for encoding it’s as thought nothing had ever been encoded before.
>
> I didn't say that you have to change words. I just said that I could agree to a slightly differently worded phrase.

An æ ligature is a ligature of a and of e. It is not some sort of pretzel. What Deseret has is this:

10426 DESERET CAPITAL LETTER LONG OO WITH STROKE
        * officially named “ew” in the code chart
        * used for ew in earlier texts
10427 DESERET CAPITAL LETTER SHORT AH WITH STROKE
        * officially named “oi” in the code chart
        * used for oi in earlier texts
1xxxx DESERET CAPITAL LETTER LONG AH WITH STROKE
        * used for oi in later texts
1xxxx DESERET CAPITAL LETTER SHORT OO WITH STROKE
        * used for ew in later texts

Don’t go trying to tell me that LONG OO WITH STROKE and SHORT OO WITH STROKE are glyph variants of the same character.

Don’t go trying to tell me that LONG AH WITH STROKE and SHORT AH WITH STROKE are glyph variants of the same character.

To do so is to show no understanding of the history of writing systems at all. You’re smarter than that. So are Asmus and Mark and Erkki and any of the other sceptics who have chimed in here.

> And as for precedent, the fact that we have encoded a lot of characters in Unicode doesn't mean that we can encode more characters without checking each and every single case very carefully, as we are doing in this discussion.

The UTC encodes a great many characters without checking them at all, or even offering documentation on them to SC2. Don’t think we haven’t observed this.

>> The sharp s analogy wasn’t useful because whether ſs or ſz users can’t tell either and don’t care.
>
> Sorry, but that was exactly the point of this analogy. As to "can't tell", it's easy to ask somebody to look at an actual ß letter and say whether the right part looks more like an s or like a z.

By “can’t tell” I mean “recognize as essentially the same letterform”. The streetsigns in some German cities use a very ſʒ if you look at it and know anything about typography. Most people probably don’t notice. They see ß and that’s precisely because ſs and ſʒ look very much alike.

> On the other hand, users of Deseret may or may not ignore the difference between the 1855 and 1859 shapes when they read.

The people who wrote the manuscripts are dead. Most readers and writers of Deseret today use the shapes that are in their fonts, which are those in the Unicode charts, and most texts published today don’t use the EW and OI ligatures at all, because that’s John Jenkins’ editorial practice. The need to distinguish these letters (which are distinguished because of their history as letterforms, not because of the diphthong) is no different from the reason we encoded these Ꜩ Ꜫ Ꜭ Ꜯ Ꜳ Ꜵ Ꜷ Ꜹ Ꜻ Ꜽ Ꜿ Ꝃ Ꝁ Ꝅ Ꝇ Ꝉ Ꝋ Ꝍ Ꝏ Ꝑ Ꝓ Ꝕ Ꝗ Ꝙ Ꝛ Ꝝ Ꝟ Ꝡ Ꝣ Ꝥ Ꝧ Ꝩ Ꝫ Ꝭ Ꝯ Ꝺ Ꝼ Ᵹ Ꝿ Ꞁ Ꞃ Ꞅ Ꞇ. Scholars required those. Manuscripts may contain them side by side. Or their usage may be separated by hundreds of kilometres or hundreds of years. There is no difference. There were pages of discussion as to WHY scholars needed the medievalist characters. The counter argument was “Why not normalize?” We had similar pages of discussion as to WHY Uralicists needed t
he great many characters we encoded for them.

Why is it that you people can encode BROCCOLI on the basis of nothing but “people might like it” but we cannot use sound existing precedent to encode characters which (while similar in use to other characters) are an index of orthographic change in a historical script and orthography? There are plenty of “glyph variations” in early Deseret texts vis à vis which I’d ignore.

This isn’t one of them.

> Of course they will easily see different shapes, but what's important isn't the shapes, it's what they associate it with. If for them, it's just two shapes for one and the same 40th letter of the Deseret alphabet, then that is a strong suggestion for not encoding separately, even if the shapes look really different.

Martin, there is no answer to this unless you can read the minds of people who are dead a century or more. Therefore it is not a useful criterion, and the other criteria (letter origin, spelling choice) are the indices which must guide our understanding. The result of those criteria is that there are four characters here, not two.

> No Fraktur fonts, for instance, offer a shape for U+00DF that looks like an ſs. And what Antiiqua fonts do, well, you get this:
>>
>> https://en.wikipedia.org/wiki/%C3%9F#/media/File:Sz_modern.svg
>
> Yes. And we are just starting to collect evidence for Deseret fonts.

Well you aren’t going to get full repertoires from the 19th-century lead type because they don’t exist. We have what we have of them, and we have the manuscripts. As to modern digital typefaces, there are NONE which support the 1859 letters. And I’ve seen most of them.

>> And there’s nothing unrecognizable about the ſɜ (< ſꝫ (= ſz)) ligature there.
>
> Well, not to somebody used to it. But non-German users quite often use a Greek β where they should use a ß, so it's no surprise people don't distinguish the ſs and ſz derived glyphs.

I’ve received German texts which used Greek β. But that’s not the point. People don’t distinguish the ſs and ſʒ glyphs because they look pretty much the same AND there’s no reason to distinguish them. A world of difference between that and the Deseret LETTERs WITH STROKE.

>> The situation in Deseret is different.
>
> The graphic difference is definitely bigger,

For pity’s sake, Martin. 𐐉 𐐃 look NOTHING ALIKE. And 𐐅 and 𐐋 look NOTHING ALIKE. This isn’t anything like ſs and ſʒ and ſz and ß.

> so to an outsider, it's definitely quite impossible to identify the pairs of shapes. But that does in no way mean that these have to be seen as different characters (rather than just different glyphs) by insiders (actual users).

They had a script reform and they cut new type. The did this on purpose. Note that in their ligatures they shifted from SHORT AH and LONG OO to LONG AH and SHORT OO.

> To use another analogy, many people these days (me included) would have difficulties identifying Fraktur letters, in particular if they show up just as individual letters.

I do not believe you. If this were true menus in restaurants and public signage on shops wouldn’t have Fraktur at all. It’s true that sometimes the orthography on such things is bad, as where they don’t use ligatures correctly or the ſ at all.

I’ll stipulate that few Germans can read Sütterlin or similar hands. :-)

> Similar for many fantasy fonts, and for people not very familiar with the Latin script.

What’s a fantasy font? And what does this have to do with supporting the encoding in plain text of historical documents in the Deseret script?

>> The lower two letterforms are in no way “glyph variants” of the upper two letterforms. Apart from the stroke of the SHORT I 𐐆 they share nothing in common — because they come from different sources and are therefore different characters.
>
> The range of what can be a glyph variant is quite wide across scripts and font styles. Just that the shapes differ widely, or that the origin is different, doesn't make this conclusive.

LONG OO WITH STROKE is not a glyph variant of SHORT OO WITH STROKE. LONG AH WITH STROKE is not a glyph variant of SHORT AH WITH STROKE.

>> I don’t think that ANY user of Deseret is all that “average”. Certainly some users of Deseret are experts interested in the script origin, dating, variation, and so on — just as we have medievalists who do the same kind of work. I’m about to publish a volume full of characters from Latin Extended-D. My work would have been impossible had we not encoded those characters.
>
> No, your work wouldn't be impossible. It might be quite a bit more difficult, but not impossible.

No. Wrong. Wrong, wrong, wrong. No, Martin. We encoded the Latin characters on the basis of good arguments. You do NOT get to invalidate that, or to pretend that the encoding of those characters was a mistake, or anything like it. Many scholars — including myself — use these characters, and that is what the Universal Character Set is for.

Also, apparently, it is for pictures of BROCCOLI.

> I have written papers about Han ideographs and Japanese text processing where I had to create my own fonts (8-bit, with mostly random assignments of characters because these were one-off jobs), or fake things with inline bitmap images (trying to get information on the final printer resolution and how many black pixels wide a stem or crossbar would have to be to avoid dropouts, and not being very successful).

All of use make use of nonce glyphs for examples. That’s not the same as making an edition of a medieval Cornish text, or of a Mormon diary. We do NOT want to have to use font trickery

> I have heard the argument that some character variant is needed because of research, history,... quite a few times. If a character has indeed been historically used in a contrasting way,

Contrast may be geographical or temporal.

> this is definitely a good argument for encoding. But if a character just looked somewhat different a few (hundreds of) years ago,

Also, LATIN LETTER D WITH STROKE is a different letter from LATIN LETTER T WITH STROKE. Why? Because the underlying letters are different. And it’s no different for Deseret.

Your suggestion that LONG AH WITH STROKE and SHORT AH WITH STROKE are the same character is unsupportable.

> that doesn't make such a good argument. Otherwise, somebody may want to propose new codepoints for Bodoni and Helvetica,…

This suggestion is nonsense.

On 28 Mar 2017, at 11:59, Mark Davis ☕️ <mark_at_macchiato.com> wrote:

> I agree with Martin.
>
> Moreover, his last paragraphs are getting at the crux of the matter. Unicode is not a registry of glyphs for letters, nor should try to be.

DESERET LETTER LONG AH WITH STROKE is not a glyph variant of DESERET LETTER SHORT AH WITH STROKE.

> Simply because someone used a particular shape at some time to mean a letter doesn't mean that Unicode should encode a letter for that shape.

Coming to a forum like this out of a concern for the corpus of Deseret literature is not some sort of attempt to encode things for encoding’s sake.

> We do not need to capture all of the shapes in https://upload.wikimedia.org/wikipedia/commons/f/fc/Gebrochene_Schriften.png simply because somebody is going to "publish a volume full of" those shapes.

That analogy has nothing to do with the discussion about the Deseret letters.

On 28 Mar 2017, at 12:33, Martin J. Dürst <duerst_at_it.aoyama.ac.jp> wrote:

> Do you think that the 1855/1859 distinction is needed in file names? In text messages? It may help in some kinds of databases, but it may also be possible to just tag each piece of text in the database with "1855" or "1859" if that distinction is important (e.g. for historical documents). As far as I understand, we are still looking for actual texts that use both shapes of the same ligature concurrently.

I think that this is the sort of distinction that should be made in plain text, yes. The 1859 letters are not "glyph variants” of the 1855 letters by any criterion in the history of writing systems that I recognize.

On 2017/03/28 01:20, Michael Everson wrote:

>> Ken transcribes into modern type a letter by Shelton dated 1859, in which “boy” is written 𐐒<𐐃𐐆>, “few” as 𐐙<𐐆𐐋>, “truefully” [sic] as 𐐓𐐡<𐐆𐐋>𐐙𐐋𐐢𐐆, and “you” as 𐐏<𐐆𐐋>.
>
> These are all 1859 variants, yes?

Yes, it was one letter written by one person at one sitting and he used one orthography and he didn’t mix it with the other orthography.

> That would just show that these variants existed (which I think nobody in this discussion has doubted), but not that there was contrasting use. And is that letter hand-written or printed?

They had a script reform. At first Mormons used the letter SHORT AH WITH STROKE [ɒɪ] for /ɔɪ/ and then later they used LONG AH WITH STROKE [ɔːɪ] for /ɔɪ/. And at first Mormons used the letter LONG OO WITH STROKE [ɪuː] for /juː/ and then later they used SHORT OO WITH STROKE [ɪʊ] for /juː/. And some Mormons didn’t use either, they just wrote the diphthongs with digraphs of other letters.

On 28 Mar 2017, at 13:10, Martin J. Dürst <duerst_at_it.aoyama.ac.jp> wrote:

>> And the same goes for the /juː/ ligatures. The word tube /tjuːb/ can be written TYŪB 𐐓𐐏𐐅𐐒 or 𐐓𐐧𐐒 or 𐐓<𐐆𐐋>𐐒. But the unligated the sequences would be pronounced differently: 𐐓𐐏𐐅𐐒 /tjuːb/ and 𐐓𐐆𐐅𐐒 /tɪuːb/ and 𐐓𐐆𐐋𐐒 /tɪʊb/.
>
> Ah, I see. So we seem to have five different ways (counting the two ligature variants) of writing the same word,

That’s called spelling.

> with three different pronunciations.

No, that’s wrong. I give those transcriptions to show the usual meanings of the Deseret letters. So if you were going to write “tube” /tjuːb/ you would write 𐐓𐐏𐐅𐐒 or 𐐓𐐧𐐒 or 𐐓<𐐆𐐋>𐐒. In the second sentence I show that while the ligated letters 𐐧 and <𐐆𐐋> can be used for /juː/ the unligated sequences 𐐆𐐅 and 𐐆𐐋 would in principle be pronounced /ɪuː/ and /ɪʊ/ respectively.

Obviously the pronunciation of the word “tube” would not have changed for speakers of English in Mormon territories in the middle of the 19th century. (Of course many dialects of English in North America now have /tuːb/ rather than /tjuːb/ but that is not relevant here.

> The important question is whether the two ligatures do imply any difference in pronunciation (as opposed to time of writing or author/printer preference), i.e. whether the ligated sequences 𐐓𐐧𐐒 or 𐐓<𐐆𐐋>𐐒 are pronounced differently (not by a phonologist but by an average user).

No, it’s spelling.

Michael Everson
Received on Tue Mar 28 2017 - 08:57:14 CDT

This archive was generated by hypermail 2.2.0 : Tue Mar 28 2017 - 08:57:14 CDT