Re: issues storing ZWSP in docs, files and databases

From: Ngwe Tun (ngwestar@gmail.com)
Date: Tue Aug 28 2007 - 00:21:55 CDT

  • Next message: Philippe Verdy: "RE: Control picture glyphs (was Re: Apostrophes at www.unicode.org)"

    Dear Dr Virach and Friends

    Thanks for your replies;

    Uniform word breaking for south east asia script should try as spelling
    checking utitlities. We were faced that terrible problem in our language(s).

    Please go following link for that similar concept which was presented by Ed
    Trager;
    1) Page 7 of URL http://www.unifont.org/textlayout/TheBigPicture.pdf
    2) Page 38 of URL
    http://www.unifont.org/TextLayout2007/presentations/DAM4_June_2007_Text_Services_Presentation.pdf

    If we try by colloborative effort, It might be a big picture for us. It's
    very similar concept with spelling check, rendering engine, OCR and input
    method too. Those software have language dependent seciton. So, We will
    create uniform word breaking for south east asia, it will have language
    dependent section. I understand that ICU have that kind of model. I'm sure
    that it is not included for burmese language.

    According to Zaw mail, We are preparing lexical resources in Myanmar NLP
    Team. We are seeking way to get word break by dictionary based method and/or
    rule based method.

    Shall we effort in that regarding word breaking issues by Mr. Javier for
    Khmer Language, Dr. Virach for Thai Language, Ed Trager for Thai and other
    language, Myanmar NLP Team for Burmese language. Other parties can join to
    our colloborative effort. That would be great effort. We will provide our
    language nature, lexical resources and make uniform model. :) Dream may come
    ture and useful for south east asia.

    Please feel free to write your comments.

    Regards

    Ngwe Tun

    On 8/28/07, Virach Sornlertlamvanich <virach@tcllab.org> wrote:
    >
    > Hi Javier,
    >
    > It is great if the policy works in Khmer. It means a writing revolution
    > of a language. ZWSP will help in disambiguating the concept of word unit.
    > It is my curiosity how people can be familiar with the word breaking and
    > make it uniform. Invisible break will also be a barrier to unify the
    > concept. Will you have a survey about the uniformity of word breaking
    > among them?
    >
    > Virach
    >
    > Javier SOLA wrote:
    > > For Khmer we have chosen to go directly for ZWSP. All those learning
    > > to type Khmer Unicode (including the public school system) are taught
    > > to introduce the ZWSP between words. This forces them to recognize
    > > what is a word and what is not (an effort that they did not have to do
    > > before, but people get used to it quite quickly).
    > >
    > > It will still be quite a number of years until all applications
    > > support line breaking for Khmer, so this is the only available
    > > solution for Khmer.
    > >
    > > ZWSP also helps for searching, as it separates words.
    > >
    > > We have a small Java application that introduces ZWSPs in a text,
    > > using a dictionary. This is specially necessary when we convert texts
    > > in legacy encodings to Unicode, and then we have to introduce the ZWSP.
    > >
    > > Regards,
    > >
    > > Javier
    > >
    > > Virach Sornlertlamvanich wrote
    > >> Hi Ngwe Tun,
    > >>
    > >> We leave the Thai text as it is because:-
    > >> 1. it is to allow the input as it is seen.
    > >> 2. it is hard to find a good consensus in word breaking as well as
    > >> sentence breaking. A range of recognition boundary is allowed for
    > >> both word and sentence.
    > >>
    > >> Though a word break marker is introduced it is hardly to obtain an
    > >> agreed result text. It will also burden the author in inserting a
    > >> boundary maker.
    > >>
    > >> Our current solution is to have a segmentation operation which can be
    > >> implemented by rule based or dictionary based or the hybrid approach.
    > >> Though the accuracy is not 100% correct but it is better than to do
    > >> it manually.
    > >>
    > >> However, we still have problem in handling the line end of html text.
    > >> There is no good evident to put a space character at the line end or
    > >> not when layout adjustment is needed.
    > >>
    > >> Virach
    > >>
    > >> Ngwe Tun wrote:
    > >>> Hi Friends
    > >>>
    > >>> By our complex nature of languages, In Burmese, Khmer, Laos and
    > >>> Myanmar haven't word break marker in our sentences. So, We need to
    > >>> hardly identify word and put a marker by operator or automatic. It's
    > >>> really hard work to make. some language needed to identify sentence
    > >>> ending marker also.
    > >>>
    > >>> Then, I founded good discussion @
    > >>> http://blogamundo.net/dev/2006/12/28/the-zero-width-space/
    > >>> concerning ZWSP.
    > >>>
    > >>> I'm doubting that ZWSP adding in Burmese language text storing in
    > >>> documents, files and databases.
    > >>>
    > >>> In Burmese; character(s) combined to a syllable, syllable(s)
    > >>> combined to a word, word(s) combined to a sentence. We Just have
    > >>> sentence ending marker. We can identified by a syllable by rule
    > >>> based solution. We do not need too much complex in syllabification.
    > >>> But We have to work too much in word identification. I'm hoping that
    > >>> we need to have good lexical resources. (Please correct me if I'm
    > >>> wrong)
    > >>>
    > >>> So, After syllable breaking algorithm, we can have segmented text by
    > >>> syllabification. here, I wanted to know what syllable break marker
    > >>> are using on your language (Thai, Khmer, Lao and any possible
    > >>> language). Second, I hope so Is it needed to add by operator or
    > >>> automatic by shaping engine(Uniscribe, Pango, QT, ATT or what else
    > >>> Programs) or input method(Keyboards, IME or On Screen Keyboard).
    > >>>
    > >>> ZWSP are defined in Unicode 5.0 @ Section 16.2 Page 535;
    > >>> */
    > >>>
    > >>> Zero Width Space.
    > >>>
    > >>> /*The U+200B zero width space indicates a word boundary, except that
    > >>> */ /*it has no width. Zero-width space characters are intended to be
    > >>> used in languages that have*/ /*no visible word spacing to represent
    > >>> word breaks, such as Thai, Khmer, and Japanese. */ /*When text is
    > >>> justified, ZWSP has no effect on */letter/* spacing—for example, in
    > >>> English or Japanese usage.
    > >>>
    > >>> We have to use ZWSP for the word breaking in our language. So, We
    > >>> need to use ZWSP for line breaking purpose too. Every Burmese word
    > >>> might follow ZWSP when automatically adding or operator.
    > >>>
    > >>> Please let me have last clarification. Do We need to store ZWSP in
    > >>> documents, files and database for the purpose of word
    > >>> segmentation/breaking? Or Is it possible to add automatically in
    > >>> others way?
    > >>>
    > >>> Let me have your experiences in word breaking and ZWSP issues in
    > >>> your language.
    > >>>
    > >>> Thanks in advance.
    > >>>
    > >>> Ngwe Tun.
    > >>>
    > >>>
    > >>> --
    > >>> In Burmese; Ngwe mean 1) Silver 2) Money 3) Second Awards; Tun mean
    > >>> 1) Light 2) be prominent.
    > >>
    > >>
    > >
    >
    >
    >

    -- 
    In Burmese; Ngwe mean 1) Silver 2) Money 3) Second Awards; Tun mean 1) Light
    2) be prominent.
    


    This archive was generated by hypermail 2.1.5 : Tue Aug 28 2007 - 00:24:07 CDT