RE: dotless j

From: Christopher J. Fynn (cfynn@dircon.co.uk)
Date: Mon Jul 05 1999 - 21:44:34 EDT


Peter Constable wrote:
 
> >How do you make a j lose its dot if you do not have a dotless j available? I
> don't get it. It would seem to make sense to me to bend the rules in this case
> and have a dotless j even if it is a glyph and not a character used by any
> language.
 
> The dotless j glyph can be contained in a font without requiring that there be
> an entry in the cmap to access it directly, i.e. without requiring
> that there be a Unicode value associated with it. Just as Unicode doesn't need to contain
> every Arabic contextual form and ligature, every positional variant of Thai
> diacritics, every Devanagari conjunct, every Hangul syllable, etc.

One problem is that there are font editors which either insist that you
assign a "Unicode" cmap value - or which assign one by default when generating
a TrueType font. Some large companies also seem to consistently assign specific
values to particular presentation glyphs in their fonts. Once such a consistent
entry is present some users will start to adopt it as a "character" for that glyph.
If there are several widely used assignments for this glyph you then can
end up with several unofficial character encodings for this entity among different
communities of users which is just what we are trying to get away from.

If this sort of thing is going to occur (and I think it is more or less inevitable)
won't it cause less problems to assign this entity an official encoding despite
it's dubious claims to this status?
 
> >If we really want to convince all programmers to use Unicode, we can hardly
> insist that they add low level code to every single program they
> write to remove
> the dot from the j by directly manipulating the fonts.

> No more than that they all want to write code to render a bunch of other
> scripts. The answer for them is general purpose rendering code (be it in the
> form of libraries, system calls, whatever) that will take as input Unicode
> characters, a font identifier, and a designation of the intended
> language/writing system (in the general case, just the Unicode
> characters is not enough) and that returns the appropriate sequence
> of glyphs (with positioning info) and/or draws the glyphs on the appropriate
> device. (It's necessary to identify the font in the input unless some adequate
> set of glyph IDs can be assumed).

I think what Michael is pointing out is that it is not so difficult to convince
users of complex scripts, programmers who write applications which handle
these scripts, and developers of non-Latin fonts to embrace Unicode in what has
been up till now the Latin script dominated world of computing. There are
clear and apparent benefits to this community.

Latin script users, font designers and even developers have much less incentive
to adopt systems that support Unicode computing - especially if they feel it
makes life more difficult for them.
 
> >Wouldn't it be considerably simpler to just add a dotless j to the Unicode
> standard so that font designers become motivated to include it in the fonts?
 
> That's not the motivation font designers need. Font designers have to, and do
> (perhaps implicitly), design with one or more specific writing systems in mind.
> If they decide to design for a writing system that uses j with various
> diacritics, they will include a dotless j glyph, and/or whatever is needed to
> present the j with those diacritics.

And most of them will assign a "Unicode" cmap value to that glyph and, no matter
what the Unicode standard states, so long as they can access that glyph by that
value many users will start adopting that assignment as a character representing
that entity.
 
(If you want to avoid this perhaps there should be something which states
that glyphs in a font which are presentation forms for Unicode characters
or their combinations should not have a cmap entry and only glyphs for
unencoded characters or symbols should have encodings in the user area.)

> I say that the font designer must have particular writing systems in mind,
> meaning that they shouldn't attempt to create fonts for the general
> case. Aside
> from the fact that the latter would create fonts that are unwieldingly large
> (MS's Arial Unicode is something like 23MB, and it doesn't include extra
> presentation forms), it may not be possible with current font technology:
> TrueType fonts, at least, can have at most 64K glyphs. If one were to design a
> font for all of Unicode, it would take more than that many glyphs, even if
> planes 1 to 14 are ignored. So, we don't want to motivate designers to include
> dotless j in general - if they add dotless j for you today, they'll
> be asked to
> add i- and o-width overstriking accents for someone else tomorrow, and on and
> on. If they don't aim for a particular target, they'll never hit anything.

> [from a subsequent message also by Adam:]
 
> >I doubt I will sound more convincing to them when I tell them they need to
> parse the fonts directly and remove dots from certain letters just
> because it is
> cast in stone that Unicode only deals with characters, never with glyphs...
>
> >If we don't KISS, many programmers will refuse to embrace us.
>
> I wholly agree. See my second comment above. Application programmers will need
> to make some adjustments to deal with Unicode, but it can be kept to a minimum
> if they are provided with appropriate enabling technologies. (Of course, some
> programmers need to come to the rescue of their peers and deal with
> the latter.)

If these "enabling technologies" are part of huge C++ libraries which add
considerable overhead to an application or if they involve too
many arcane, poorly documented and difficult to debug API calls many
developers will choose not to use them (and ignore support for many of
the scripts in Unicode) when writing things like text editors, email
clients, database front ends, and so on.
 
- Chris



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT