From: Carl W. Brown (email@example.com)
Date: Sun Nov 10 2002 - 11:15:52 EST
There already is a Unicode solution for the problem. Check UAX #21. If search engines use case insensitive compares then it should be no problem.
There a a lot of exceptions to the rule so that you need separate characters for the forms but you also need an algorithm that works reasonable well for most cases.
"Character (final sigma) is preceded by a sequence consisting of a cased letter and a case-ignorable sequence, and character is not followed by a sequence consisting of an ignorable sequence and then a cased letter."
From: firstname.lastname@example.org [mailto:email@example.com]On Behalf Of Jim Allan
Sent: Saturday, November 09, 2002 2:49 PM
Subject: Lunate, Terminal, and Medial Sigma
Patrick Rourke posted:
So either there should only be one sigma, with the presentation being determined by position (unless the font defines both positions as lunate), or there should only be the medial and terminal and no lunate "symbol," with lunate being defined only by the font - but then most people entering Greek text would just use the medial form for all sigmas, regardless of the position. Maybe
text entry could correct this . . .
This would only work in most cases in modern Greek, and less often in historical documents.
Yannis Haralamous in “From Unicode to Typography, a Case Study: the Greek Script” at http://omega.enstb.org/yannis/pdf/boston99.pdf, writes:
The letter sigma has a final form, written ς. Although this is a contextual property, there is a Unicode character for this letter: U+03C2; this is perfectly justified, because in some cases there is a semantical difference between the medial and final form of σ: for example, “φιλοσ.,” is necessarily the abreviation of some word (like φιλοσοφία) while “φιλος.” is a single non-abbreviated word, followed by a sentence period. In cases like this the form of the σ cannot be determined by a simple algorithm.
There is a typographical curiosum, involving the final sigma: the Grammar of Pontiac Language by K. Topkharas ([Top], reprinted in [Top₂]), published in 1928, in the Soviet Union, for the (Pontiac) Greek speaking minorities. This grammar completely abolishes accents, breathings, diphthongs, and uses only part of the alphabet. The ς is used for the sound ‘s’, and a double ςς for the English ‘sh’. Here is an excerpt of this book [Top, p. 49]:
Σιν γλοςανεμυν επεμνεν ας αρχεον τιν γλοςαν κε το ακλιτον το λεκςοπον α πυ μεταχιριςκυςανατο ι παλιεμυν, ονταν εθελναν να φανερονε πος καπιον ιδιοτιταν πυ εςς εναν προςοπον για πραμαν, λιφταςςκετιατο καπιον αλο λ.χ. δινατος κε αδινατος.
>From "SIGMA" by Katerina Sarri at http://users.otenet.gr/~bm-celusy/sigma.html:
By c.400 B.C.E. sigma took its final shape Σ at all greek city-states. The final <ς> was a later calligraphic version, when ending some words, and gradually, when ending all words. In old manuscripts it may be marked also within composed words (as the final letter of the first word) as in: ειςβάλλω = εισβάλλω < εις+βάλλω ( I go in, attack). Also, the 'lunate sigma' (as looks the third letter of the latin alphabet) C was used instead of Σ,σ,ς (in the byzantine manuscripts, and today as a calligraphic variety, especially by the church).
One might indeed work with a smart-sigma text entry routine, like the smart-quotes routines, but would also want to be able to turn off or override it if necessary, as one can with smart-quotes routines without relying on propietary switches in a particular font, not always accessible through every program, and perhaps different algorithms used by different fonts.
A intelligent font in which the above quotations could not be properly produced because it has its own ideas where variants ought to appear or does not have them is less useful than a stupid font which puts out what the writer produced. Unicode with three versions of lower-case sigma is more useful than Unicode with a single version. Encoding only one lower-case sigma would not reduce the complexity, only push it up to differing and incompatitable higher protocols.
When the character variants have distinct semantics or distribution that cannot be predicted algorithmically and is not random, encoding these variants at the Unicode plain text level is simple and robust and does not prevent a higher protocal from identifying the characters for particular purposes.
Patrick Rourke also posted:
I just can't wait for all the search failures resulting from searching for τις in a text
Given the increased number of characters and variants allowed by Unicode, complexity of intelligent searching also increases.
Search engines should allow variant insensitivity and diacritic insensitivity as they now usually allow case insensitivity. Case insensitivity is usually the default setting and so should be variant insenstivity and perhaps diacritic insensitivity.
This should be better supported than it is.
Even Google distinguishes, I think foolishly, between caesar and cæsar and between fluss and fluß, to give two examples.
But even given that a search engine recognizes such variants, one still has to deal with spelling differences, eg. ecumenical, oecumenical,œcumenical, eucumenical.
>From the specifications for the Pandora search engine at http://etext.lib.virginia.edu/helpsheets/pandora.html:
Note that Pandora treats medial, final, and lunate sigma as the same letter.
As Unicode becomes more widely used, search engines will adapt.
This archive was generated by hypermail 2.1.5 : Sun Nov 10 2002 - 11:55:32 EST