Re: Purpose of plain text from Christoph Päper on 2011-11-15 (Unicode Mail List Archive)

From: Christoph Päper <christoph.paeper_at_crissov.de>
Date: Tue, 15 Nov 2011 14:11:10 +0100

Doug Ewell:

> How can I search a group of documents, one written in Devanagari and another in Sinhala and another in Tamil and another in Oriya, for a given string if they all use the same encoding, and the only way to tell which is which is to see them rendered in a particular font?

That question made no sense if you didn’t consider them different scripts.

It is a good indication in favor of unification, for example, when you use your local script for all loan words from other languages that use related scripts, but there are counter-examples:
— Japanese Katakana and Hiragana have developed from the same source for the same language, so that discrimination does not apply.
— In German Fraktur texts you will see modern Romance borrowings or English xenisms set in Antiqua (but old Greek and Latin loan words in Fraktur), which would be in favor of disunification by above criterion.
I have no (sufficient) idea how it works in India and its South Asian neighbors. Most books I read on writing or scripts or writing systems look at each system – identified by varying definitions – in isolation and connect them only by descent, not by discrimination.

> Latin (Antiqua) and Fraktur and Gaelic letters are, intrinsically, the same letter. That is not true for Devanagari and Sinhala and Tamil and Oriya letters.

If I understand Naena Guru correctly they want to unify all the brahmic-indic scripts (similar to ISCII) and, furthermore, unify them (in a transliterating manner) with the roman script. The second part is silly, unless there is a romanization movement I’m unaware of.

Whether to draw the line between two related scripts or between two hands (fonts, typefaces, …) of the same script is sometimes an arbitrary, yet informed, decision.

In “Euroscript” – the combination of Cyrillic, Greek and Roman scripts – some uppercase letters look the same most of the time, but lowercase letters (of these similar letters) differ, often quite a lot. That alone is good enough a reason not to unify them. Yet, each of the scripts has similar glyphic variation for all of its letters, but only if two of them can be used in the same text for different purposes one has to distinguish them in coding, too. This only applies below the lexical level, though, i.e. an italic ‘a’ inside an otherwise upright word or vice versa is still the same letter, but an isolated italic ‘a’ may need to be distinguished from an upright ‘a’ – since this most often happens in formulae it comes down to the question whether you want to be able to encode more notations (incl. IPA phonetics) than written language, i.e. “true writing”.

Alphabetic scripts, i.e. those that use vocalic and consonantic letters at the same level and no diacritics, are by definition the easiest to encode digitally. For the rest, however, there is more than one way to skin a cat. It is quite possible that the Brahmic/Indic family wasn’t encoded in the best way for two reasons: related scripts could have been unified and you can approach most of them from at least two directions (segmental or syllabic). Of course, it’s probably too late to change now.

It’s tough to find a definition that fairly and usefully distinguishes symbol, sign, mark, letter, character, stroke, diacritic, glyph, graph, grapheme, frame … If you have one and can get everybody to agree on it, you then still have to decide which of the entities to encode, which software layer to render them and how to type them on a keyboard. You have to stick to that decision.

Sadly I haven’t seen a good definition, not everyone agrees, and deviation from the decision is common.

Unicode, for instance, usually tries to encode what it thinks of as characters, but under certain conditions it does accept letters, e.g. precomposed characters (including Hangul syllabograms and CJKV sinograms), and symbols, e.g. emoticons.
Received on Tue Nov 15 2011 - 07:14:37 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 15 2011 - 07:14:39 CST