Re: Searching data: map countries to scripts from Asmus Freytag on 2012-08-20 (Unicode Mail List Archive)

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Mon, 20 Aug 2012 09:24:09 -0700

On 8/20/2012 12:04 AM, Manuel Strehl wrote:
> Thanks for the answer.
>
> It's clear to me, that I could map "Hana" and "Kata" to "US" just for
> the sake of having a Japanese Minority in the states. Of course, the
> mapping must be sensible in a way, that is, explain, how the mapping
> is done. I'd be fine, I guess, with having all official languages and
> important historic ones respected (disputable cases, where larger
> minority languages are suppressed, may exist of course).
>
> Basically I'm looking for an n:m chart with ISO-639 on the left, and
> ISO-15924 on the right. When the data itself is annotated with "used
> by 0.2% of population" or "historic" that's all the better, because
> then I could define my own cut-off limit. When there is only a prose
> explanation of how the data was accumulated, I could judge, if the set
> suits the task.
>
> When there is no such data set whatsoever, I'll be off to scrape
> Wikipedia again, but that is, as I've written, not an effective or
> particularly error-free approach.

There are other sources than wikipedia.

I think what you are engaging in here is a bit of original research, in
other words, you may be the first to try to put together this particular
data set.

The usual statistics work off the number of "speakers" of languages, not
which script they are written in (few languages are routinely written in
either one or some other script simultaneously, and if so, usually the
division is by territory).

So you might make a map from language to script first (allowing some 1:n
and allowing local differences in that map). Then you can plug in
statistics on language use. There are many sources you can use, for the
US, see http://www.mla.org/census_main/

A map would be more interesting if you could find a way to split larger
territories, such as the US, Russia, China, India, etc. into some
suitable subdivisions. Notice how the language map for the US shows
non-English languages nicely concentrated along the coast and borders.

A./

>
> Cheers,
> Manuel
>
> 2012/8/20 Asmus Freytag <asmusf_at_ix.netcom.com>:
>> On 8/19/2012 4:05 PM, Manuel Strehl wrote:
>>
>> Hello,
>>
>> I'm looking for a data source, that maps countries to scripts used in
>> them. The target application is a visualization in the context of my
>> codepoints.net site, namely http://codepoints.net/scripts.
>>
>> At the moment I've extracted the prefered scripts from CLDR (e.g., Cyrl
>> for Russia, Latn for Germany and so on). Then I've added some historic
>> scripts by looking at corresponding Wikipedia articles and did some
>> manual updating. However, this yields a not really satisfactory result.
>>
>> For example, Russia has only Cyrl associated, while, as far as I can
>> tell, at least Latn and Arab should also be mentioned, also perhaps some
>> historic scripts.
>>
>> I'd appreciate any pointers if and where I could find data sets that aid
>> me in completing and error-proofing this mapping.
>>
>> Cheers,
>> Manuel
>>
>>
>> Heck, my utility bill in the US has Thai and Chinese characters (for the
>> fine print, not the statement itself). There's one more script, could be
>> Cyrillic, don't have one in front of me right now. In some areas of town
>> you'll find a mixture of scripts on shop signs as well.
>>
>> The point it's easy to identify a majority script, but to get an accurate
>> handle on "other" scripts is going to be tricky, if not impossible. And it
>> all depends on your arbitrary decision of what other scripts to include and
>> on what basis.
>>
>> A./
>
Received on Mon Aug 20 2012 - 11:27:37 CDT

This archive was generated by hypermail 2.2.0 : Mon Aug 20 2012 - 11:27:38 CDT