From: Murray Sargent (email@example.com)
Date: Tue May 08 2007 - 23:43:30 CDT
Choosing fonts for arbitrary Unicode plain text is a very difficult problem to handle well. Blocks are not generally well suited to this endeavor as Mark explains. Many of the useful heuristics are context sensitive. You’re better off starting by dividing Unicode’s defined character codes into various ranges that map to writing systems, which are not necessarily scripts. Code pages describe writing systems, but code pages only span a portion of Unicode, so you need to use a generalization of the code page. Note that assigning a writing system is a lot easier than assigning a language. Fonts tend to deal with writing systems more than with scripts, unless the writing systems are handled by single scripts. Even if a font claims to cover a writing system, you have to deal with the fact that the font may fail to have glyphs for the more obscure characters, e.g., extended Cyrillic characters. Neutral characters such as blanks, digits, and much punctuation need to be assigned by context. When you’ve assigned a run of text to a writing system, you can choose an appropriate font. You still have to deal with things like panose; do you want serifed or nonserifed, etc. Without such aesthetic guidelines, you can end up with a ransom note look to multilingual text. And there are many other considerations…
The bottom line is that there’s a lot of added value to good font binding of Unicode plain text. The RichEdit editor that I work on has over 1000 lines of pretty efficient C++ to deal with this and the treatment could still be significantly improved. Wish I had a simple answer.
From: firstname.lastname@example.org [mailto:email@example.com] On Behalf Of Don Osborn
Sent: Tuesday, May 08, 2007 7:25 PM
To: 'Mark Davis'
Cc: 'Kenneth Whistler'; firstname.lastname@example.org; email@example.com
Subject: RE: Ranges/blocks ; font lookup by range
Thanks Mark, I hear (read) what you’re saying. However there is still another step of finding fonts for those characters, and presently font info is given with reference to the blocks (e.g., at http://www.alanwood.net/unicode/fonts.html ) . As you note, there is no guarantee that a font listed with the block has everything in it, but until we have a database to look up fonts by a set of characters, blocks are at least a guide
There are still people hacking 8-bit fonts to get the added characters for such and such language. Part of public education about Unicode, I think, is to explain not only that that is not necessary, but how to find Unicode fonts for particular needs.
You’re right, concerning multiple blocks, esp. for some Latin-based orthographies:
*there are a number of lower/upper case Latin character pairs located across blocks - with the lower case being in IPA Extensions, the “double shift” block (filed under Phonetic Symbols http://www.unicode.org/charts/symbols.html#PhoneticSymbols , but with many characters also serving in standard Latin orthographies for some languages)
*some orthographies use characters across maybe 4 blocks in addition to the basic Latin ones.
Then there are needs in multilingual contexts where by the time you count up all the character needs it’s almost simpler to ask for complete blocks or the whole range(s) of Latin blocks (like Nigeria). But this gets to other issues and ultimately, if MS’s direction with font expansion is any indication, we may have more fonts with more complete repertoires of extended Latin (and other scripts) and this whole issue won’t be as much of a problem.
For the moment, however, my main concern is explaining font selection (among other basics) to a group of translators and writers focusing on producing some documents and web content in main languages of Senegal and Gambia, and I may have to resort to blocks and ranges to give some sense of “structure” (demystifying Unicode or something).
From: firstname.lastname@example.org [mailto:email@example.com] On Behalf Of Mark Davis
Sent: Tuesday, May 08, 2007 12:32 PM
To: Don Osborn
Cc: Kenneth Whistler; firstname.lastname@example.org; email@example.com
Subject: Re: Ranges/blocks ; font lookup by range
The important concept for most people is the actual list of characters required for writing a given language. This does not align with the notion of "block" in the Unicode Standard, which is often a matter of historical accident based on when chunks of characters were incorporated. While people made efforts to have blocks be reasonably consistent in content, they don't necessarily correspond to actual usage.
Thus a character list for a language may span multiple blocks, and yet not include all of the characters in any single block. I think you generally just want to avoid using the term "block".
On 5/8/07, Don Osborn <firstname.lastname@example.org<mailto:email@example.com>> wrote:
Thanks Ken for the detailed explanations and all for the info & discussion.
The way I understand it then, if you were talking about a language that uses
some extended Ethiopic/Ge'ez characters, you might say you need a font with
(selected characters in) the "Ethiopic Supplement block" more properly than
"Ethiopic Supplement range" but in the end it's pretty much the same?
IOW, anything named (such as "Latin Extended-B" or "Arabic Supplement") is a
block but any group of characters that is contiguous could be referred to as
a range? So you could refer to a font having all Latin blocks or ranges?
Sorry this is tedious, but in introducing the concepts to users who don't
necessarily need technical precision but do need to get how the system is
organized, one wants clarity and simplicity but not inaccuracy in terms. One
thing a user encounters in practice that doesn't look like a "block" is the
window you get in Word or Write when doing insert symbol. That is, the
character groups almost inevitably start & stop in mid "row" (in the
conventional, not 10646 sense) for various reasons (size of window; font
that has only selected characters from a block). This is not a complaint in
any way - just thinking out loud about one issue among many for a
Anyway, this gives me a better perspective on the terminology, so thanks
> -----Original Message-----
> From: firstname.lastname@example.org<mailto:email@example.com> [mailto:firstname.lastname@example.org <mailto:email@example.com> ] On
> Behalf Of Kenneth Whistler
> Sent: Monday, May 07, 2007 7:46 PM
> To: firstname.lastname@example.org<mailto:email@example.com>
> Cc: firstname.lastname@example.org <mailto:email@example.com> ; firstname.lastname@example.org<mailto:email@example.com>
> Subject: Re: Ranges/blocks ; font lookup by range
> > > 1) Is "character range" or "character block" the preferred term
> > In Unicode, a block is a named entity associated with a range of
> > characters that is an integral multiple of 16.
> > That should provide the relation between these two terms. A 256
> > character range inside the Unified CJK Ideographs block, for example,
> > not a block. (In 10646 it's called a 'row', if aligned on even 256
> > boundaries, but that's not a widely understood term out of context).
> Refining a little bit on Asmus' definitions:
> A Unicode block is a named entity associated with a range of *code
> that is an integral multiple of 16.
> You need to specify it that way, because a Unicode block can and often
> does contain unassigned (= reserved) code points, and may, in some
> instances, even contain noncharacters.
> The exact list of blocks is specified normatively in the UCD file,
> Blocks.txt . (Or you can see a comparable listing in Annex A of
> Another way of thinking about it is that a block is a named entity
> consisting of a contiguous range of columns, where a column is
> defined as:
> Column: a range of 16 code points XXX0..XXXF
> "Column" isn't a normative term in either 10646 or the Unicode
> Standard, but is still a useful concept because it is so visible
> in the code charts.
> In the 10646 context, the following terms are also commonly used (these
> are my definitions, not normative definition in the standard):
> Row: a range of 256 code points XX00..XXFF
> Plane: a range of 64K code points X0000..XFFFF
> For comparison, here are the normative 10646 definitions:
> Row: A subdivision of a plane; of 256 cells.
> Plane: A subdivision of a group; of 256 x 256 cells.
> The Unicode Standard has adopted the term "plane" but
> doesn't make any regular use of the "row" term.
> On the other hand, the Unicode Standard makes use of the term "range"
> in its normal mathematical sense, and it can be used to specify any
> ad hoc listing of code points with a start and a stop point.
> For example, it is perfectly o.k. to talk about a character
> range, U+FFFE..U+10001, even though that particular range happens
> to span a column break, a row break, and a plane break, and also
> incorporates characters (and noncharacters) from two different blocks.
> One of the reasons why the Unicode Standard has generally moved away
> from talking too much about "Unicode character blocks", despite their
> normative status in the standard, is that they do not correlate
> well with script identity. There are a number of instances where
> a script is split across more than one block (Latin, Cyrillic, etc.),
> and there are instances where more than one script is contained within
> a single block (Greek and Coptic).
> People unfamiliar with the standard are likely to expect that if
> one talks about "the Ethiopic block", for example, that:
> A. It will contain all the Ethiopic characters.
> B. It will be a "block" in the sense Doug talked about, i.e .
> a "code page" like chunk of 256 characters 00..FF (or a
> "row" in 10646 parlance).
> C. It contains no characters used by other script.
> C happens to be true in this case, but A and B are not, because
> there are also Ethiopic characters in another supplemental block,
> and because the range of the Ethiopic block is 1200..137F.
> Interestingly, because the Ethiopic Supplement block was added
> contiguous to the Ethiopic block, the range of Ethiopic characters
> is a contiguous range, 1200..139F, even though that spans two blocks.
This archive was generated by hypermail 2.1.5 : Tue May 08 2007 - 23:45:50 CDT