Re: polytonic Greek: diacritics above long vowels ᾱ, ῑ, ῡ from Philippe Verdy on 2013-08-03 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 3 Aug 2013 08:43:43 +0200

Many precomposed characters have been encoded initially only for roundtrip
compatibility with previous existing encoding standards. But the way to go
was for encoding combining characters separately (at least those that are
normally not attached or overstriking the base letter).

Very soon, the normalization forms were created to unify the two possible
types of encodings, and introduce the concept of canonical equivalences.
The most common practice was though to use the precombined forms, so the
NFC form became a de facto standard (but the decomposed forms are not
deprecated at all, the standard has made lots of efforts to make sure that
these forms would be fully equivalent, and implementers 'notably for plain
text searches and for rendering) whve been urged to treat all canonically
equivalent forms the same way (with only one exception : collation, where
the difference is invisible at all collation strength, but is only
considered for sorting together the canonical equivalent forms with a
stable order, stability being reached by adding an artificial .
level for sorting in binary order; the binary order is still often not
standardized, implementers may notably sort them either using the numeric
values of code points, or numeric values of code units in some UTF
encoding; this binary order is in fact purely arbitrary and only meant for
ensuring sort stability).

The complexity of implementations to ensure the canonical equivalent texts
are treated the same, caused the definition of "conforming processes". And
to make sure that processes would remain conforming, and that encoded texts
(in any normalization form, and also independantly of the UTF used for
parsing or storing/sendind texts) would remain usable after being encoded
once, even if the Unicode and ISO/IEC 10646 standards evolve; it was
soonnecessary to add the concept of "encoding stability", and this was
formalized by a strong policy.

One consequence of the string policy is that we can no longer encode new
precomposed characters for grapheme clusters that are already encoded in
any existing standard form. The macron was encoded separately since long,
as well as all basic Greek letters (even before many polytonic characters
were introduced in the UCS). This makes now impossible to introduce new
characters for greek letters with macron (except if we accept to make them
non canonically equivalent, and this would create serious issues, because
any conforming processes will not be rewritten or modified to recognize
additional "visual equivalences" (which are different from "canonical
equivalences".

So you have to live with it, as long as the UCS will remain a universal
standard supported both by international standard bodies, as well as the
industry (you can expect it will remain a standard for at least one
century, and even after that, it will remain widely used and interchange
will be needed because there will be tremendous amounts of data that will
be archived and won't be reencoded).

But this is not a problem. The "de facto" NFC form (used for compatibility
with old processes, is less and less effective, and lots of processes are
now recognizing the canonical equivalences, and are able to process
grapheme clusters encoded with several characters including combining
characters. Storage space is also no longer a major issue (the problem is
less in the encoding of a few clusters, than in the growing amount of
encoded texts). If storage size really matters, we have used since long
binary data compressors, like ZIP/deflate, gzip (whose performance today is
extremely fast, used only during the input or ouput of the complete encoded
text, but invisible to the more complex and more specific text parsers that
application will need).

It may still be a minor issue for "texts" that mist be extremely compact,
such as those used as "identifiers".

* But identifiers are generally invisible to the users and it is accepted
that the exact orthography of identifiers is simplified (identifiers are
frequently abbreviated as well). In fact identifiers are avoiding many
standard practices applicable to normal texts (for example not using
presentation forms, or limiting the usage of punctuation, or avoiding
changes in capitalization, or adding new requirements about it that is
completely foreign to the normal orthographic/grammatical rules of a human
language). If your aplication using identifiers cannot use combining
characters, you will drop them (and make identifiers distinct by using some
other chracters such as adding digits or other convention).

* In other cases, strong limitation of lengths will occur in some data
input forms that use too short sizes for the storage (notably in
databases). But the needed increase of storage max length is also something
to take account if your application needs to handle international texts.
There are some good practices to follow, and generally this means creating
an UI where data fields can be read without scrolling horizontally or
without breaking lines, choosing the appropriate font sizes. And then
define a database length that will allow input and full stoage of any text
that can fit in the displayed fields (for example using VARCHAR(80) and not
VARCHAR(12) if you can input 12 Latin characters in your UI. But some newer
languages do not need to restict storage length in their strings and
applications will not be impacted by these size limits and will not
restrict storage sizes to small values (database engines support texts with
unspecified max length will still have a limitation, but it is generally
long enough that you will still be able to fit any text in a reasonable
input form ; if this limit is 255 code units, it is still long enough for
storing a single data input field on any form for entering text in any
existing language.

So in summary no decision actually lead to excluding the encoding of Greek
letters with macrons, but no need to do it was made as they were still not
encoded in any standard, and the UCS already encodes them and TUS already
standarrdizes the best practices for handling them in any standardized
normalization forms and standardized encodings (or legacy encodings
supported by roundtrip compatibility and listed in an informative appendice
of the Unicode standard).

2013/8/3 Stephan Stiller <stephan.stiller_at_gmail.com>

>
> Characters restricted to dictionaries are generally not well
>> supported.
>>
> And modern textbooks in a modern world :-)
>
>
> The practice in Scott and
>> Liddell is to reserve ᾱ, ῑ and ῡ for a note after the dictionary entry.
>>
> Liddell & Scott is old, just like Lewis & Short. We've moved on since
> then, and given the stuff that's been put into the Greek blocks (things
> that for sure aren't even in most dictionaries) I was just surprised.
> Whatever the rationale for original precomposition and later inclusion of
> more characters was, I suppose common practice instead of inclusiveness was
> a criterion.
>
> With that written, thanks for the info.
>
> ῑ̓́φιος [...] ῑ̓́ (which should be thought of as ῑ
>>>
>>> with two combining diacritics: U+1FD1 U+0313 U+0301)
>>>
>> You overlooked the smooth breathing for the first iota.
>>
> It's there. Check again.
>
> Stephan
>
>
>
Received on Sat Aug 03 2013 - 01:47:05 CDT

This archive was generated by hypermail 2.2.0 : Sat Aug 03 2013 - 01:47:09 CDT