Unicode corpus tools/missing characters

From: PAUL BAKER (j.p.baker@lancaster.ac.uk)
Date: Mon May 24 1999 - 06:32:41 EDT


Hello (I'm a recently subscribed addition to the mailing list).

Can anyone advise me on the availability of commerical corpus tools
(concordance/collocation etc) which are able to handle Unicode
characters?

Also, I am in the process of converting a small Punjabi corpus from an 8-bit
Indian font into the Gurmukhi Unicode characters (using UniEdit 1.4 from Duke
University). However, I am facing a few problems:

1) some of the diacritic characters in the font don't exist in the
Unicode Standard. In particular the pehri haha and pehri rara. I'd also
like to be able to input a bindi with a horizontal joining line.
2) some of the diacritics that do exist in Unicode aren't represented
well by UniEdit (notably the bindi 0A02 and UU 0A42)
3) some of the missing diacritics appear in private use slots in
UniEdit.

I was wondering if anyone else had come up against limitations either
for the Unicode Editor they were using, or in the Unicode Standard
themselves - and if so, how they dealt with it. I've tried using a "best
fit" solution by employing other characters, which is not ideal. I'm
wondering if I should invest in another editor. And what would happen if
I tried to open a UniEdit .uni file in another Unicode editor? Would it
open at all? How would it handle the private space characters?

I apologise if these are very naive questions or have already been dealt
with.

Paul Baker
Minority Languages Engineering Project
Lancaster University
UK.
http://www.ling.lancs.ac.uk/monkey/ihe/mille/public/title.htm



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT