The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Fri Oct 31, 2014 7:20 am

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 2 posts ] 
Author Message
 Post subject: Computing with the Lao language: syllabification and related
PostPosted: Sat Sep 24, 2011 3:41 pm 
Offline

Joined: Fri Sep 23, 2011 12:56 pm
Posts: 7
Location: Niort, France
// Message to the CLDR list, reforwarded here for interested people
// Any hint about this issue would be welcome.

I was recently requested to provide help for computing with the Lao
Language, and notably if I could find more accurate sources for the
Lao syllabification and the issues that occur when handling various
problems with this language.

Mozilla currently has an RBBI implementation based on a published
article found on the web, but also documented and initiated in a work
by the Lao government, in its National Authority for Science and
Technology (NAST). This work has produced a draft document, published
online, but in a very old HTML format, and with a formatting that is
difficult to exploit. In addition, the website chosen has many
technical problems (lots of timeouts during HTTP requests).

When replying to the person asking me for help, he could not download
the page I submitted to him, so I decided to provide a backup that I
had of the page, basically reformated in a Word DOCX document.

I ma not the author of the document, there are lots of uncorrected
English typos in it, but anyxay it may help some people interested.
Anyway I have not been able to contact the NAST lab that has published
it.

I am now asking you if there has been updates on this work initiated
and published as a draft in 2008-2009.

Is there some ongoing work on making a correct syllabification for Lao ?

Here is the link to the reformated document, in Word DOCX format:

https://docs.google.com/leaf?id=1HbpXI7 ... NfKm-3uXPw

Or in PDF format :

https://docs.google.com/viewer?pid=expl ... JmODMyMmRm

(Note: as I am not the author, I don't own the copyright. On request
by NAST or by authors of this doc, I will remove immediately this
republication if they ask me by private email; the document itself had
no defined title, I just made a generic title from its subject)


Top
 Profile  
 
 Post subject: Re: Computing with the Lao language: syllabification and rel
PostPosted: Sat Sep 24, 2011 9:54 pm 
Offline

Joined: Fri Sep 23, 2011 12:56 pm
Posts: 7
Location: Niort, France
I also found the following implementation for the Lucene project, based on ICU in Java:

http://svn.apache.org/repos/asf/lucene/ ... rator.java

using this rules set in ICU's RBBI format:

http://svn.apache.org/repos/asf/lucene/ ... 9/Lao.rbbi

Note that the two algorithms combined (RBBI rules and the Java class performing the rollback) can be combined into a LALR syntax (with just one letter of look-ahead, but with rollback support: this is possible with Yacc/Bison or ANTLR, but it's certinaly possible to combine them in a more predictive, more optimized way, to solve cases for which the rollback is certain to fail). There's one thing that is not implemented: sequences <X9,X10> should remain unbreakale in all cases, and you should not rollback one character in such a way that the implementation will split it. Similar issues should be searched by carefully analyzing the RBBI rules.


Now I'd like to know what happens when the LaoBreakIterator, using the RBBI rules in a greedy way, determines that it must rollback one character, determines that it still does not satisfy the break condition for the next syllable, and decides to keep the long (greedy) first syllable, so that the remaining next "syllable" (containing only one Lao character) will be invalid according to Lao syllable rules:
  • When does this case occurs? With foreign borrowed words, more or less adapted to the Lao script ? (Lao mostly uses monosyllabic "words" or morphemes aggregatable into compounds (but still breakable between each partn each part with possibly complex syllabic structure with tones; however borrowed words from foreign do not obey the traditional Lao syllabic rules, or may be multisyllabic, and Lao may avoid breaking them)
  • Or because of a Lao orthographic typo that a spell corrector could try correcting by looking up for loose matches in a dictionnary (for example finding occurences of swapped letters, or missing letters or superfluous/doubled letters?

The first item will be useful in web browsers for correct line breaking, the second will only be useful to text editors and spell correctors. But both cases require the presence of a dictionnary for performing the lookup of exceptions or for proposing corrections. But are there such dictionaries available outside of books or old documents currently not encoded in Unicode but with some of the many forms of "LAO ASCII" used in various non-Unicode-mapped fonts?


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 2 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 0 guests


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com