Greek Language and Script
Q: Why are there two blocks of Greek
characters in the Unicode Standard?
A: The layout of the Greek script in the Unicode Standard
is an artifact of the history of Unicode and of ISO/IEC 10646. The
Unicode Standard started out with just the Greek block (U+0370..U+03FF),
with Greek characters laid out in compatibility with the modern Greek
monotonic standard, ISO/IEC 8859-7, and with additions for some Coptic,
ancient Greek, and Greek symbol letters. When the Unicode Standard had
the repertoire from drafts of ISO/IEC 10646 merged in, as part of the
standards compromise which resulted in the synchronization of the
Unicode Standard and 10646, the Unicode Standard acquired a
collection of pre-composed Greek characters which were intended for polytonic Greek
usage. Those had to be placed somewhere, and a
"compatibility" block was created at U+1F00..U+1FFF to accommodate them.
Q: Doesn't the existence of two blocks of
Greek characters create problems for searching in Greek?
A: No. If you examine the code charts for the
U+1F00..U+1FFF block of "extended" Greek carefully, you will note that
all the polytonic Greek pre-composed characters have canonical mappings.
This means that they are canonically equivalent to sequences consisting
of the basic letters plus sequences of the basic letters plus combining
voicing and accent marks. Any properly constructed Unicode search
operation should treat canonical equivalents the same, so it should not
matter whether one specifies a target match in terms of the pre-composed
characters or in terms of the sequences of basic letters and combining
marks. This situation for Greek is no different from the requirement for
the Latin script that a search for a pre-composed Latin letter and the
same letter with a combining accent mark produce the same results.
Q. Which block of Greek characters should I
use?
A: The answer to that is that it depends what you are
doing. But generally, the basic Greek block plus the use of the generic
combining marks in the Combining Diacritical Marks block
(U+0300..U+036F) is the best approach to polytonic Greek support. Some
fonts do not directly support the display of the pre-composed extended
Greek characters, and most current systems and browsers do a decent job
for Greek using generic fonts. In any case, best display of Greek data—particularly
polytonic Greek data—will result from use of
specially-designed Greek fonts which handle all combinations of Greek
accents optimally.
Q. What is the order of the accents on
ancient Greek letters? Should they come before capital letters? If so,
should they be spacing marks?
A: The order of accents on ancient Greek letters is the
same as all other cases in Unicode: the accents are represented by
combining marks that appear after the base letter. The canonical order
can be seen either by looking at the polytonic Greek charts in the
Unicode Standard or at the
online
normalization charts.
Q. Why does Unicode encode a
separate character for the final sigma in Greek? Doesn't that
violate the character-glyph model?
A: There are actually three reasons for this, all of
which conspire to support the same result.
First, there is very extensive legacy practice for handling Greek
characters. And in most of the major Greek character encodings, a
character for the final sigma and a character for the non-final sigma
are distinguished. This includes IBM Code Pages 423, 851, and 869,
Windows Code Page 1253, the HP Greek8 code page, ISO 8859-7, and the
Macintosh Greek code page. Ignoring this legacy and failing to encode a
separate lowercase final sigma and non-final sigma would just have
resulted in major interoperability issues for Unicode and all
preexisting Greek data in those character
encodings.
Second, the usability of a rendering model involving
positional alternate glyphs for characters depends in part
on the distribution and regularity of those forms in each particular
script. The Arabic script is at one end of this continuum, since it is a
cursive script, with predictable glyph shape variations for
every character based on word position; such a script fits
naturally into a processing model which has a basic character for each
"letter", and then dynamically picks presentation forms (or even
ligatures) based on positional analysis. Greek, on the other hand, is a
non-cursive script, and in modern usage, at least, has basically just
the single positional variant form, for sigma. In the latter case,
burdening the rendering model with positional variant analysis is a bad
engineering tradeoff, just to get the two sigmas to be represented by a
single code. It is easier to simply equate the two sigma codes for
operations which are concerned with word content, for example.
Finally, a detailed analysis of Greek corpora and the
usage of final sigma and non-final sigma makes it clear
that no simple positional context rule would cover all
the cases. The rule is actually rather complex and has
lots of exceptions, for abbreviations and other special
cases. That the "rule", if indeed there is a single rule,
is so complex, indicates that 1) it would be difficult
to implement, and would probably lead to nagging inconsistencies between
implementations, and 2) that the long history of final sigma and
non-final sigma as character entities (encoded or not) has resulted in
them starting to accrue some independent "characterhood", enabling
people to think of uses for them outside their canonical positions.
Taken all together this was an easy call: Unicode should
(and does) have a separate character code for the Greek
final sigma and the non-final sigma.
Q: How do I represent a mute iota?
A: In Greek, the vowels α η ω can be
followed by a mute iota. In those cases, the iota is written in smaller
size, under the letter: ᾳ ῃ ῳ , and it is represented
using U+0345 COMBINING GREEK YPOGEGRAMMENI.
In initial capitalization and in all-caps words, one can
find a wide range of graphic presentations of the mute iota:

The proper sequence of characters to use depends on the graphic
presentation you want to achieve:
for 1-3, use U+0345 COMBINING GREEK YPOGEGRAMMENI
for 4, use U+03B9 GREEK SMALL LETTER IOTA
for 5-7, use U+0399 GREEK CAPITAL
LETTER IOTA (may be styled in small caps)
Conversely, rendering systems usually render a mute iota represented by U+0345
COMBINING GREEK YPOGEGRAMMENI as one of 1-3,
render a mute iota represented by U+03B9 GREEK SMALL LETTER IOTA as 4 and render a mute iota represented by U+0399
GREEK CAPITAL LETTER IOTA as one of 5-7.
However, it is perfectly acceptable for a rendering system to produce any of the
graphic presentations of mute iota from any of the coded character
representations, much like it is perfectly acceptable for a rendering system to produce a small
caps graphic display of lowercase text.
Note that this has implications for case conversion. In
particular, U+0345 contains information that can be lost when
converting to uppercase. This is not unusual with case mappings:
converting "McGowan" or "vedereLa" to uppercase also loses information.
[EM]
Q. Where can I find a detailed, scholarly analysis of all
the problems related to the Greek script and Unicode?
A:
Greek Unicode Issues has more than you will probably ever need to
know about all the Greek encoding issues related to Unicode. It also has
links to other sites dealing with Greek.