Re: Unicode corpus tools/missing characters

Date: Mon May 24 1999 - 10:50:43 EDT


       We in SIL have overlapping interests with you since we are
       heavily involved in supporting linguistic research in a large
       number of minority languages (currently over 1000) around the
       world. We have a variety of software tools for doing linguistic
       research, some of which may perform the tasks that you need.

       We are in the process of a major re-engineering of our tools.
       Unfortunatly, we do not have Unicode-capable versions available
       at present. That is being added, however, as is the capability
       to define the writing-system-specific needs of each minority
       language for rendering, collation, etc. So, I can't help you
       today, but in the future our software may be of use to you.

       For further information on our language software tools, feel
       free to visit our web site at


       From: AT internet on 05/24/99 05:32

       Received on: 05/24/99

       To: Peter Constable/IntlAdmin/WCT, AT
       Subject: Unicode corpus tools/missing characters

       Hello (I'm a recently subscribed addition to the mailing list).

       Can anyone advise me on the availability of commerical corpus
       tools (concordance/collocation etc) which are able to handle
       Unicode characters?

       Also, I am in the process of converting a small Punjabi corpus
       from an 8-bit Indian font into the Gurmukhi Unicode characters
       (using UniEdit 1.4 from Duke University). However, I am facing
       a few problems:

       1) some of the diacritic characters in the font don't exist in
       the Unicode Standard. In particular the pehri haha and pehri
       rara. I'd also like to be able to input a bindi with a
       horizontal joining line.
       2) some of the diacritics that do exist in Unicode aren't
       well by UniEdit (notably the bindi 0A02 and UU 0A42)
       3) some of the missing diacritics appear in private use slots
       in UniEdit.

       I was wondering if anyone else had come up against limitations
       either for the Unicode Editor they were using, or in the
       Unicode Standard themselves - and if so, how they dealt with
       it. I've tried using a "best fit" solution by employing other
       characters, which is not ideal. I'm wondering if I should
       invest in another editor. And what would happen if I tried to
       open a UniEdit .uni file in another Unicode editor? Would it
       open at all? How would it handle the private space characters?

       I apologise if these are very naive questions or have already
       been dealt with.

       Paul Baker
       Minority Languages Engineering Project
       Lancaster University

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT