The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Tue Oct 21, 2014 12:16 am

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: languages versus writing
PostPosted: Wed Jun 29, 2011 11:58 am 
Offline

Joined: Wed Jun 29, 2011 3:32 am
Posts: 8
Unicode is about writing, enabling a writing system to be represented in a standard form with fixed codes for the basic character repertoire of the language plus rules on how these characters should be combined and used within software claiming compliance with the standard. This seemed wonderful until I had worked for a few years in Nepal, and encountered a dispute about how you should write a small language, Lohorung, of only 1,207 speakers but no writing. One group favoured Devanagari, the locally dominant system, with suitable extensions if needed. Another group favoured Roman (Latin). I wondered, why not both? I suggested this to a colleague, an expert on writing systems, that you could write a language in any writing system. He disagreed.

But you can! Many languages have switched their writing system because of perceived (or imposed) advantages. Turkish switched from Arabic to Roman in the 1930s, the Central Asia states switched from Arabic to Cyrillic under Russian occupation, and many are now shifting to Roman. In the cross border communities between India and Pakistan, communities on the Pakistani side write in an extension of Arabic, while those on the Indian side write in Brahmi scripts - Devanagari for Sindhi, Gurmukhi for Punjabi.

So what should the Lohorung do? UNESCO advises, for a language previously unwritten, that you start by determining its phonemic inventory, the set of meaningfully distinct sounds - and then decide how to write these in the chosen system. However, why not just encode the phonemes, and then arrange open type fonts to render those phonemes in the writing systems of choice. We could do this for Lohorung, deciding on a block in the private use area and fix fonts to render these codes in Devanagari, and in Roman. I have made some pilot experiments to do this, but rendering text encoded in the Devanagari block into Roman using FontForge, following the Library of Congress scheme for romanizing Devanagari; I have not gone the whole way, it needs algorithms akin to those used for Text to Speech, but am satisfied that it is possible using Open Type technology.

Taking this approach resolves problems with other languages in Nepal. The Limbu language is written in both Devanagari and Sirijanga (the mis-named Limbu block) and needs easy ways of switching between these, as easy as switching a font. The Newar language is written in a number of distinct styles, and experts have tried to encode each style in distinct blocks, without realising that there is only one language here with a number of distinct fonts. There are many other examples.

While my cases come from Nepal, there are a number of general issues here. A question I posed at a Unicode conference was what is the difference between a code block (a writing system) and a font (a style of writing) - is it like the difference between a language and a dialect? One has an army, the other does not.

Should we encode phonemic inventories? Should these be within the Unicode code space and standard, or in some other code space and standard? Do we identify languages with the phonemic inventory, or is it in the space between the phonemic inventory and the writing system(s), in the mappings between these with which we might associate the orthography?

So many questions, does anybody have any answers?


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Sat Jul 02, 2011 12:10 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
Usually Unicode picks up at the point where your task is completed, that is, after the writing system is defined (and in use).

I wouldn't be able to advise you, except to point out a few obvious things. If you use only characters already defined in the standard, you can start using Unicode-based tools right away.

However, if you create a mix of character from different scripts, or decide that some characters should have different layout behavior in the context of your new writing system, that would likely make adoption of this writing system more difficult, because even if software supported the characters as such, it would do the wrong things for th new writing system.

Worse, some things just cannot be supported easily or at all. One recent orthography uses the character @ as a letter. You can imagine what headaches that causes for software deailing with e-mail. It's practically not possible to support such a system in conjunction with the other writing systems of the world.


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Sun Jul 03, 2011 9:17 am 
Offline

Joined: Wed Jun 29, 2011 3:32 am
Posts: 8
So far I have had 1 reply and 39 views, and also two private messages which I had hoped would be reposted here.

The issue I see is whether people concerned with languages in the computer should move on beyond Unicode, though keeping Unicode and Open Type and keyboard mappings as powerful technologies for continued use. Should we move to encoding the language independently of Unicode?

I posted some examples of languages that are written in several different writing systems; mine were from Nepal, and the private message gave further examples from Kurdish (Arabic and Roman) and from Ethiopia (Ethiopic and Roman). The writing systems are not mixed up with characters from one system used within another, as asmus’ post seemed to understand. In these many examples of multiple ways of writing a particular language, no particular way of writing is necessarily the definitive reference version. What remains the same between these different representations? It is of course the language.

One question debated by linguists is whether it is the spoken language or the written language that is definitive. Historically it was taken as the written language, but this was overturned by Saussure, though Harris has pointed out that Saussure was not entirely consistent in this. To my mind a language is more abstract, it is manifest in all its representations, it is both the written form and the spoken form.

In my first post I referred to the UNESCO guide for developing writing systems via a phonemic inventory, which I suggested should be directly encoded. This would encode the spoken language, able to be the source for speech generation, and relatively soon we hope, the target for speech input of phonemes. An encoded spoken passage could be converted to an encoded written passage using methods from speech recognition, and in the opposite direction using the methods of text to speech generation; both must necessarily use lexicons as well a general rules.

We have many excellent representations of the written language, in Unicode for the actual writing systems, and in dictionaries to complete the orthographies. Now that we have the technologies for doing so, isn’t time we moved on to the spoken form?

And moved beyond Unicode?


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Sun Jul 03, 2011 9:40 pm 
Offline

Joined: Tue Jan 11, 2011 9:26 pm
Posts: 1
Mapping Latin characters onto Devanagri codepoints in a font would be a bad idea. Esp. if the users want to use not only Lohorung but other languages that use the Devanagri script.

What you want to achieve could be achieved through transliteration tools, assuming it is possible to handle roundtrip conversions between the Latin and Devanagri representations of Lohorung.

It is difficult to comment without having a more detailed understanding of the Latin and Devanagri orthographies for Lohorung.

Andrew

integrationist wrote:
So what should the Lohorung do? UNESCO advises, for a language previously unwritten, that you start by determining its phonemic inventory, the set of meaningfully distinct sounds - and then decide how to write these in the chosen system. However, why not just encode the phonemes, and then arrange open type fonts to render those phonemes in the writing systems of choice. We could do this for Lohorung, deciding on a block in the private use area and fix fonts to render these codes in Devanagari, and in Roman. I have made some pilot experiments to do this, but rendering text encoded in the Devanagari block into Roman using FontForge, following the Library of Congress scheme for romanizing Devanagari; I have not gone the whole way, it needs algorithms akin to those used for Text to Speech, but am satisfied that it is possible using Open Type technology.

Taking this approach resolves problems with other languages in Nepal. The Limbu language is written in both Devanagari and Sirijanga (the mis-named Limbu block) and needs easy ways of switching between these, as easy as switching a font. The Newar language is written in a number of distinct styles, and experts have tried to encode each style in distinct blocks, without realising that there is only one language here with a number of distinct fonts. There are many other examples.



Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Thu Jul 14, 2011 4:14 pm 
Offline

Joined: Wed Jun 29, 2011 3:32 am
Posts: 8
Speaking and writing – two parallel universes

See attached file - I needed figures to support my argument.


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Thu Jul 14, 2011 5:15 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
Don't see any attachments.


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Fri Jul 15, 2011 1:19 am 
Offline

Joined: Wed Jun 29, 2011 3:32 am
Posts: 8
yes, its true, no files, The upload file facility did not work for me, and without that working, I may just have to give up. However, I will have another try. I thought about setting out what I had said in words, and encourage by asmus having had a look, I will even do that.


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Fri Jul 15, 2011 2:08 am 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
The file size limit for attachment is a rather paltry 40KB. If you have a larger file, the way to do it, is to post it somewhere else and then add a link here.


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Fri Jul 15, 2011 4:22 am 
Offline

Joined: Wed Jun 29, 2011 3:32 am
Posts: 8
Speaking and writing – two parallel universes

failed again, it did not accept PDFs, crashed on docx, then I saw asmus' message that attachments are limited to 40KB, very paltry indeed given that my pdf os 86 KB, and it really is a very small document.

I will rewrite as pur text, but not now, I'm off to visit Sorrento. The figures I produced will be very useful at a conference I am attending next week.


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Fri Jul 15, 2011 10:31 am 
Offline
Engineer

Joined: Mon Nov 30, 2009 7:14 pm
Posts: 40
Location: Earth
This file-upload business is a side issue, but... The file size limit is specifically set low and/or disallows types because it's really only for little images to accompany discussions. We don't want this board to become a place where people store documents.

The best approach if you need to upload a file or proposal for discussion is to use one of the myriad "drop" services on the web where you can upload documents. Perhaps try Google Docs. It's easy and worked for me...

Having said that, if you really can't find anyplace to store your document, e-mail it to me and I can put it somewhere for the purposes of this discussion.


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Sat Jul 16, 2011 3:31 am 
Offline

Joined: Wed Jun 29, 2011 3:32 am
Posts: 8
Here we go again, this time in plain text, replacing my DFDs by the equivalent scenarios.

Speaking and writing – two parallel universes

Most people reading this thread will think using a language in the computer involves writing, with a scenario rather like the one illustrated below for a particular language written in a particular writing system encoded in Unicode. In former times we would write with a pen or pencil on paper, and file that for later reference or dispatch it in an envelope to its intended reader. Today we have computing technology.

Scenario of written communication
Author uses a keyboard or keypad to type in language L in writing system G.
Keymap software converts key press codes to Unicodes for writing system G.
Word processor used to add to and edit document, with orthography for L written in G guided by spell checkers and grammar checkers.
Author stores encoded written document for later access or transmits it to some other location.
...
Reader accesses the stored encoded document.
Font technology renders it in writing system G.
Reader reads the communication from author.

Think of this as one way of people communicating, where information is exchanged entirely in writing, as in SMS messages, blogs, social network software, and academic discussions like this Unicode Forum.

If a language is written in two (or more) writing systems, then there are parallel systems like this for each writing system G1, G2, etc, with possible conversions between them using transliteration software as indicated by Andrewc. In principle the orthography in one writing system could be quite different from that in another system, and simple transliterations might not suffice, some form of dictionary look-up will be necessary.

However there is another parallel universe of speaking that we inhabit in our daily lives, communicating with each other over a range of issues. Speech disappears as soon as it is spoken unless it is specially recorded, which we only occasionally do in faciliites like voice mail. Now imagine that in this parallel speech universe you could do similar things to the written universe, storing it for access remotely over space and time, as illustrated below. The author speaks what he/she wishes to communicate. Simple audio recording will not suffice, that is technology of a former age.

Scenario of spoken communication
Author uses a microphone to speak in language L in dialect P.
Speech recognition software recognises the phonemes of dialect P and converts these to the codes for the phonemic inventory P of language L.
Speech editor used to add to and edit spoken document, with orthophony for L spoken in dialect P guided by pronunciation checkers.
Author stores encoded spoken document for later access or transmits it to some other location.
...
Listener accesses the stored encoded spoken document.
Speech generation technology renders it in dialect P of language L.
Listener listens to the communication from Author.

The phonemes of language L dialect P will have been previously identified following standard phonological methods, and then given codes equivalent to the Unicodes for graphemes. This process is like the process suggested by UNESCO and SIL for creating writing systems of unwritten languages, only here we stop at the point of producing the phonemic inventory. This leads to a compact representation of the spoken document using those phoneme codes, as compact as the equivalent in writing, in which the individual voice of the speaker has been lost, only the semantically significant linguistic utterances remain. To recover this document as speech, to render it orally, a simple diphone system would be adequate, as used in concatenative speech synthesis; but we might need a more sophisticated look up to select the correct allophone from the phoneme class depending upon a particular context.

In the written communication universe we could include the ability to speak a written document using standard TTS methods, using a range of voices of different quality of even different dialect. In the parallel speech universe we could add a small ability to render (transcribe) the spoken document graphically, which can be done directly from the phonemic encoding using font technologies, with different fonts rendering in different writing systems. This introduces some ability to move between universes.

Languages that have a recently produced written form will have their spoken and written forms completely identical, which applies to most language of Nepal, such as Gurung, Magar and Tamang. But languages that have been written for many centuries will typically have written and spoken forms that have diverged significantly, as has happened with English and French. Even for South Asian languages that proudly claim to be phonetic, there are differences that need to be taken into account.

These considerations suggest that we can look for full movement between the parallel spoken and written universes – I will describe the processes for this later if anybody is interested.

What I am hoping from this discussion is to hear from the Unicode gurus whether they have considered these issues, and if so what the conclusions were. Can Unicode handle spoken languages?


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Fri Aug 05, 2011 4:34 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
Since nobody else has replied:

Unicode does not consider the spoken languages, beause it is a codification of symbols and shapes used in writing.

How the writing relates to languages is a matter for orthographies.


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Sat Aug 06, 2011 11:42 am 
Offline

Joined: Wed Jun 29, 2011 3:32 am
Posts: 8
Thanks Asmus, I was beginning to think that the Unicode community did not care about languages, viewing Unicode as a sophisticated clip-art library. I am partially reassured.

I am concerned about unwritten languages, and if your contention is correct, then for a language to be handled by the computer using Unicode technologies, it must first of all be given a writing system and orthography, and then if needed seek an encoding for any new characters created. If it wants to focus on the encoding of its phonetic inventory, then it must seek some other as yet unformulated standard.

However, you cannot really look at a writing system (script) divorced from the living language(s) it is used to write. Let me consider two examples from Nepal. Firstly the Limbu language that I mentioned in my first post, this is spoken in eastern Nepal and has a traditional writing system, Sirijanga, created a couple of hundred years ago, and used for recording traditional cultural myths, and for Buddhist religious texts. There was no real active use of written Limbu within the community until universal education was started in the 1950s, when people learnt to write the Nepali language in Devanagari, and soon also learned to write Limbu in Devanagari to which an extra symbol was added for the glottal stop at code point 097D. However, with the relaxing of language policies in 1990, members of the Limbu language community then also wanted to be able to write in Sirijanga, and an encoding for this was created by Boyd Michalovsky and Michael Everson in consultation with members of the community, and encoded in Unicode as Limbu, 1900–194F. The proposal gives phonetic arguments to resolve some of the uncertainties, and ends up with a set of characters in one-to-one correspondence with Devanagari characters - and hence my observation that two encodings for the Limbu language is unnecessary and even confusing, a single encoding using fonts to render a text in either Devanagari or Sirijanga would be more appropriate. I would suggest staying with the Limbu code block.

Now for my second example, not referred to in my previous postings. This is the Newar language which has a rich written tradition extending back at least a thousand years. Over the years the community has developed a number of distinct scripts or writing styles, reserving particular styles for particular purposes - Ranjana writing for religious purposes, Prachalit for everyday writing, Bhujimmola for administrative purposes, and so on. Similar practices of reserving particular styles of writing for particular purposes has also occurred in European writing using the Roman system. But whatever style they are writing in, the Newars are still writing in their own Newar language (which they call Nepal Bhasha). In 2010 Michael Everson set out to encode Ranjana, with visual similarities persuading him that this as a writing system used in many parts of the Himalayas for writing Buddhist texts, though he focused on Nepal and the Newars, who in turn petitioned him to consider Prachalit as being much more useful. In the end Everson concluded that Ranjana and Prachalit and the many other styles should be unified. And so they should be, for they are used to write the same language. But there are some inconsistencies between the different scripts, and one account by Rabison Shakya seems to give Prachalit several more basic characters than Ranjana and Bhujimmola; how do we resolve this? Of course, we must refer to the language, and in this case the phonetic inventory, which shows that Shakya's account of Prachalit is correct, there are aspirated nasals not identified in his tables for Ranjana and Bhujimmola. A question not asked by Everson was whether the Newar writing should be unified in Unicode with Devanagari, just as the writing of Sindhi has been, but that would be politically unacceptable. Bizarrely a proposal to encode Prachalit has recently been posted by Anshuman Pandey (ISO/IEC JTC1/SC2/WG2 N4038) who seems to be following the same confusion as Everson, focussing on visual similarities rather linguistic use.

The title of this thread is "language versus writing". For antiquarian writing, where all you have are the written records, focussing on visual appearance is all you can do. But for the writing of living languages, surely it must be the living human users who constitute the definitive point of reference! Shouldn't it be?


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Sat Aug 06, 2011 1:49 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
Writing and language are not the same thing. While writing is ultimately motivated by rendering language in a visible form, it always follows its own conventions and rules. Once developed, a writing system will evolve separately from the language. Separately, as in "experiences changes at a different time and at a different rate than the underlying language" - not as in "totally unrelated". Of course, changes in the language will feed back to the writing system, and sometimes even the opposite can be observed.

If you merely seek to record the phonetic repertoire of unwritten language, then use one of the notational systems for that purpose - all of them are (or can be) supported by Unicode. They constitute writing systems in their own right.

Encoding a writing system in Unicode means that its constituent elements are discovered, classified and given a number. There's no a-priori best way to do this analysis, and, in principle, more than one breakdown into elements could exist. As long as it is possible to support the writing system (input, sorting, seaching, display) on computers with a given set of elements, that particular set of elements would constitute a successful encoding. It may not be the only possible one, nor satisfy any particular preference, but it would be workable and usable.

Sometimes, it's possible to compare several proposals for an encoding and conclude by careful analysis that one would be preferable over anohter because it makes certain tasks easier than others. More often, there's a tradeoff, because different tasks become hader or simpler depending on which elements are chosen for the set.

At some point, "good enough" is just that, and is better than "slightly more perfect, but not yet available".

In other words, not only is Unicode firmly focused on writing (as opposed to somehow capturing "spoken language") it is also very pragmatic. The encoding is only as good as the computer algorithms it allows to be implemented. It is the latter that users encounter when they try to enter, search, sort and display data in their language.

If a particular encoding can be shown to be unimplementable or to present tough obstacles to (linguistically) correct implementations, then it would be rejected in the proposal stage, or amended by encoding additional characters if the defects were to be found later.

If you have a problem with a particular proposal, then you should present a detailed analysis and critique of its shortcoming (measured against the usuability criteria I suggested above).

Finally, the process of unification does not mean that languages,or their writing systems are unified, merely that it is possible to support several languages using the same set of encoded elements. Some languages (or orthographies) may require additional elements, but all may use a common set. This is something that happens frequently, as writing systems are developed on borrowed templates more often than de novo.


Top
 Profile  
 
 Post subject: Re: languages versus writing
PostPosted: Mon Aug 08, 2011 3:39 pm 
Offline

Joined: Wed Jun 29, 2011 3:32 am
Posts: 8
Let me cite one of the gurus of writing, Roy Harris. In his book, “The Origins of Writing” (Duckworth 1986) he summarizes the generally held view that “Writing set human communication free from the limitations imposed by the impermanence of speech, and dispensed with the live presence of a speaker” (page 24). These he sees as consequences of writing, but not the original reason for the invention of writing; “From a technical point of view, writing is an extension of drawing, or more generally of graphic art” (p26) which he then spends the rest of the book giving supporting arguments for. The point to note here is that it is human communication that writing enables, necessarily expressed in a particular language.

This forum is about computing and information technologies, and its use in human communication. Harris writes that “writing had been the world’s most advanced communication technology from the fourth millennium BC down to the fifteenth century” (p24) with later inventions recognised by him being printing, and then telegraphy and telephony. Now computing and information technology embraces and eclipses all of these, but this takes us beyond Harris’ writings, and this is what I had wanted to work through in this forum.

This thread should really have been titled “languages versus encodings” to emphasize the computing connection, but “writing” helps bring out the focus of Unicode exclusively on graphic communication, so well brought out in the previous post by Asmus.

It does seem that the Unicode Standard, now in version 6.0.0, is irremediably focused on writing as sequences of graphical characters conforming to the rules of a writing system; the rules are very thoroughly described in Unicode 6.0. A sequence of characters might represent a human communication expressed in a particular language, but the particular language is viewed by Unicode as secondary: “The requirement for language information embedded in plain text data is often overstated.” (p152). If knowledge of the language is really necessary, which the standard concedes might be necessary for spell-checking or hyphenation or collation, but surprisingly not for text-to-speech generation, then a language tag can be used; language tags are in effect escape sequences for languages.

But I am concerned about computers to support communication in living human languages, whether the communication is in writing or in speech. Computing technology is the current revolution in human communication, but regrettably it seems that Unicode is not part of that, though the open type rendering technology that grew up alongside Unicode surely is.

This is my last posting to this forum thread, though of course I will read any responses.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com