Re: Mixing languages on a Web site

Date: Fri Jun 30 2000 - 17:18:31 EDT

On 06/30/2000 01:27:18 PM <> wrote:

>Just read your post to the Unicode list. I'm wondering if your site has
>Unicode sample texts available (I'm looking for just about every major
>script/language). The texts don't have to be long... but I'd like stuff
>than one or two sentences (maybe a couple paragraphs would be great). Any
>pointers would be much appreciated.

Not yet, really, at least probably not of the type and in a form that
you're looking for. There is some data but it's either in PDF, or it's
probably limited in its character repertoire to what is found in European
languages. Of the latter, I don't know whether any is encoded in UTF-8. At
any rate, try the following links:

Some of our field offices have been working on getting content ready to
publish on the web, but I'm not sure how much and of what sort, and it may
be that a lot of it will be in PDF for now.

I expect we will have a lot more linguistic data from a wide variety of
languages in the future, but this will take some time. With regard to
character sets/encodings, most of our researches have, in the past, worked
with custom character sets/encodings where commonly available standards
like cp1252 weren't adequate - linguists everywhere have had to do that, so
most existing data isn't yet in Unicode. As an organisation, though, we're
committed to Unicode, and those of us in our International offices working
on technology solutions for the researches we support are promoting the use
of Unicode as linguistic software that supports Unicode becomes available.
(We anticipate our first Unicode-enabled language software products will be
released late this year or early next year.) So Unicode-encoded data from
vernacular languages will start to become more common over the next several
years. I also expect that SIL will be getting involved in cooperative
efforts with other major linguistic agencies to start building online
archives of linguistic data, and that will likely build heavily on XML and

One key issue in putting data from hundreds of languages on the web is
fonts and rendering support for complex scripts (which includes IPA and
Roman with diacritics). There is also the issue that some minority
languages use characters that are not yet part of Unicode, or they may use
characters in Unicode but with script behaviour that's slightly different
from what occurs in the more commonly known languages (e.g. different glyph
shapes or different ligatures and ligation rules). It will just take some
time to cross all these bridges.

We do have access to electronic corpora of texts in literally hundreds of
minority languages, we know that there would be a lot of interest in those
being made available, and we want to start making it available. With
personnel resources already stretched and some technical issues still to be
worked out, this will take longer than we wish it would.

- Peter

Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT