Stefan Persson wrote as follows (text >), responding to Andrew C. West (text
>> Personally I think that markup may be more appropriate, given the
countless possible permutations of
>> combining/superscript letters that may be encountered in mediaeval texts
in various languages.
> Why not just add *two* characters, either to the PUA or to Unicode?
> U+XXXX = COMBINING LETTER ABOVE INDICATOR
> U+XXXY = SUPERSCRIPT LETTER INDICATOR
> This means that U+XXXX directly followed by "a" is a combining "a" above,
and that U+XXXY directly followed by "a" is a superscript "a".
> This means some normalisation issues:
> U+0061 U+0363 ≡ U+0061 U+XXXX U+0061
> U+00AA ≡ U+XXXY U+0061
Well, such normalisation could be as private a matter as the allocation of
the two characters to the Private Use Area. Consider please the following
scenario, which is a scenario which I have devised in a creative writing
manner as a fictional scenario, yet which does not seem unrealistic in
relation to what might happen in practice, somewhere, sometime. Suppose
please that someone wishes to transcribe the text of a medieval manuscript
so as to have the text stored in a computerised format. Upon finding
various characters in the manuscript such that he or she cannot enter them
as Unicode characters, he or she might reasonably devise his or her own
encoding list, by, say, making a handwritten list (with a view to later
putting the piece of paper through a scanner to produce a graphic file) and
use that encoding list in order to make human decisions as to which
characters to key into the computer system, perhaps doing the keying with a
program such as UniPad.
The UniPad website is as follows.
It may be that the UniPad program could be customised so as to have a
special soft keyboard to help the transcriber in keying the codes, yet even
if that is not possible the Private Use Area codes could be entered using
the character map which UniPad provides.
In such circumstances the transcriber could decide to have a Private Use
Area encoding of the characters of the manuscript on the basis of one
Private Use Area code point for each character in the manuscript or he or
she could decide to have a system which used the two operators which you
suggest together with zero or more other operators and zero or more
individual characters depending upon the repertoire of characters which
exist in the manuscript.
Certainly there are then issues of using the data once it is in a computer
file, maybe some special program will need to be written (such as a small
Pascal program, I am not meaning some major development project to produce a
special program, just something which will do what is required for the
particular transcription project), yet for someone to use two such Private
Use Area encodings in order to facilitate the task of getting the
information content accurately from the document into the computer, it seems
a perfectly reasonable thing to do. The transcriber might need to do the
transcribing of the original document during certain daytime hours at a
table in a secure library environment during a time frame arranged by prior
appointment and permissions. Once the transcribed data is in the computer,
either keyed in while in the library or transcribed from notes made using a
pencil, the transcriber and other interested people throughout the world
can analyse the meaning of the text of the document almost anywhere.
In such circumstances of some people trying to understand such documents,
maybe using the two codes within the Private Use Area together with an
ordinary TrueType font which has U+XXXX implemented so as to show a glyph of
an arrow starting by going straight upwards then going steeply diagonally
upwards in a bend dexter direction until it reaches the point of the arrow,
(as if the back half of the arrow were as in U+2191 and the front half of
the arrow were as in U+2196) and U+XXXY implemented as an arrow going
straight upwards until it reaches the point of the arrow, (similar to
U+2191) would be a way of researchers having a look at the transcribed text
of the document in a convenient manner. I only suggest those particular
glyphs as examples in this posting, please feel free to use whatever glyph
designs you wish.
Certainly, the use of such Private Use Area codes would only have any
validity in their use amongst a group of users of the Unicode system who had
agreed to use those particular Private Use Area encodings to have those
meanings. Yet the use of such a Private Use Area encoding could, I feel, be
very useful amongst such a group of researchers in that it would get the
document transcription job done and would have the considerable advantage
that if the transcribed file were to be displayed in a program such as
WordPad or Word that in order to be able to understand an indication of the
presence in the original document of any regular Unicode character combined
above any other regular Unicode character and to understand an indication of
the presence of any regular Unicode character superscripted in the original
document one would only need to have a Unicode font augmented with two arrow
glyphs in the appropriate code points.
Well, why not go ahead and decide on two code points within the Private Use
Area as values for XXXX and XXXY, post them in this list and perhaps that
action will lead to that facility becoming available as a facility to
document transcribers all around the world.
If the code points were published in this manner, maybe a font and maybe a
UniPad soft keypad which use those code points will become available in
time, and so researchers transcribing documents in libraries around the
world would have a lasting enhancement of the facilities available to them.
This method would not produce a visually correct display, yet in order to
convey meaning in a research environment, this method could help in getting
the transcribing done and thus would be a valuable addition to the
9 August 2002
This archive was generated by hypermail 2.1.2 : Fri Aug 09 2002 - 10:12:35 EDT