Re: a character for an unknown character

From: William_J_G Overington <wjgo_10009_at_btinternet.com>
Date: Mon, 26 Dec 2016 11:31:54 +0000 (GMT)

Jukka K. Korpela wrote:

> So I think this does not fall into the category of plain text, and the information should be expressed at a higher protocol level, e.g. in markup or as out-of-band information.

I opine that requiring the use of a higher level protocol needlessly makes encoding a document more complicated than it need be. Using plain text would allow the transcript to be encoded into a Portable Document Format (PDF) document when publishing the transcript.

> Such things can hardly be described using new characters; ....

I opine that they can and that it would be straightforward to do that.

Certainly the situation that it can be done does not necessarily mean that it will be done. It depends upon what people choose to do.

For example, I designed some glyphs that could be used for such characters.

http://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0071.html

Other glyphs could be designed if more are needed. Not necessarily designed by me.

The example that is quoted, namely “there is letter U or letter V, probably the latter” may need to place the U and the V between two of the new characters, yet that is not a great problem.

For example, maybe use two of the designs attached to the http://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0071.html post, for example, a character based on the glyph in transcribe_ea65.png before the U and a character based on the glyph in transcribe_ea66.png after the V. Thus a four character sequence.

If desired, a character based on the glyph in transcribe_ea67.png could be placed before a note made by a transcriber and a character based on the glyph in transcribe_ea68.png could be placed after a note made by a transcriber. That could be helpful while transcribing a document. Maybe later the note could be moved to be a footnote in a book, yet the characters could be useful when actually transcribing.

> If some graphic symbol is by convention used to represent a lacuna, then the issue, as regards to Unicode, is simply whether that symbol exists as an encoded character or whether there is need to add that graphic symbol to Unicode. But it would be a matter of encoding graphic characters (irrespectively of their meaning in some content), not about encoding abstract ideas like “an unrecognized character”.

Well, I opine that the situation needs to be assessed based upon the needs of today, not based on rules made long ago before today's needs arose.

> Perhaps there should be a universal convention about this, but it is unrealistic to expect that to happen.

I opine that it is realistic for it to happen. For example, I have published in this list nine new glyphs. In this post I have suggested meanings for four of them. I have suggested some Private Use Area code points. There are open source fonts around which people can, if they so choose, copy one of them, give it a new font name and a new file name and add in nine new characters at U+EA60 through to U+EA68 based on my designs as in the above-mentioned post. If that is what people want to do it could happen very quickly, maybe in a few days. The new open source font could then be made available on a web site and a person interested in using the font could gather it from the web and install the font on his or her own computer. The fact that the Private Use Area is part of Unicode and that fonts are organized on many computers so as not to be dependent upon with which software program they are being used means that a Private Use Area encoding can be used very effectively.

I accept that that may not be what people want to do and that it may well not happen, yet it is realistic that it could happen if people want to do it.

If it does happen then after that it would be a process to decide if the characters were used sufficiently for them to become encoded into regular Unicode.

> The Unicode Standard can hardly standardize such things.

I opine that The Unicode Standard could standardize such things if the Unicode Technical Committee decides that it wants to do that.

> And if there were such a universal symbol, it would surely have been encoded in Unicode—not because of its meaning, but because of its consistent use as a character in plain text.

Well, there are new characters being encoded every year, some of which have not existed in plain text before. Progress happens when it happens, new ideas can arise and be applied today.

> You should not expect the character to be recognized in this special meaning without such a higher-level convention.

Well, I opine that if the character is a new character designed and defined as to meaning for the purpose then such an expectation would be reasonable.

> There’s a theoretical (?) problem with this. Let us assume that you decide to use a particular character to represent “unknown character” in
your data, when working with some type of written texts. What happens when you encounter, in the study of those text, a graphic symbol that is
best identified as the character you decided to use in that special meaning? Well, I think you can decide to solve that problem if it ever appears.

An advantage of having new characters designed and defined as to meaning specifically for the purpose is that that should avoid such a problem arising - though one can never be absolutely sure about that.

William Overington

Monday 26 December 2016

----Original message----
From : jkorpela_at_cs.tut.fi
Date : 25/12/2016 - 17:31 (GMTST)
To : unicode_at_unicode.org
Subject : Re: a character for an unknown character

21.12.2016, 4:29, Martin Mueller wrote:

> Is there a Unicode character that says “I represent an alphanumerical
> character, but I don’t know which”.

I think including such a “character” in Unicode would not fit into the
the idea of Unicode as a system for encoding plain text characters. You
seem to be asking for a symbol that is not a graphic or control
character but information about uncertainty regarding a character a data
stream. So I think this does not fall into the category of plain text,
and the information should be expressed at a higher protocol level, e.g.
in markup or as out-of-band information.

When it is not certain what character there is in some text to be
encoded, there is a wide range of possible situations. For example, it
might be a thing like “there is letter U or letter V, probably the
latter” or “there is some Latin letter but no hint of what it might be”
or even “there is an alphanumerical character” (though I find it
difficult to imagine such a situation). Such things can hardly be
described using new characters; rather, they need to be expressed using
verbal descriptions (which are about the encoded text, not part of it)
or some formal notations or both.

> This is a very common problem in
> the transcription of historical texts where you have lacunas. Often, the
> extent of the lacuna is known, and the alphabet is known as well. The
> EEBO TCP transcriptions of English texts before 1700 are good examples.
> They are SGML transcriptions, where missing stuff is represented by
> <gap/> elements with attributes about this or that. This is efficient
> when it comes to pages, very inefficient when it comes to individual
> characters.

Efficient in what sense? Saving bytes can hardly be an issue here. And
if various attributes are needed to describe the case, then it would
become awkward to try to do the same with encoded characters (or
“characters”, Unicode code points).

> In the TCP project, various code points from the Geometrical were used
> to represent lacunae. The black circle (\u25cf) has been used as the
> character for a missing character.This is OK and unambiguous in its
> context.

If some graphic symbol is by convention used to represent a lacuna, then
the issue, as regards to Unicode, is simply whether that symbol exists
as an encoded character or whether there is need to add that graphic
symbol to Unicode. But it would be a matter of encoding graphic
characters (irrespectively of their meaning in some content), not about
encoding abstract ideas like “an unrecognized character”.

> But would be nice to have a special character for just that
> purpose

Various symbols are used in different contexts to indicate situations
like “there is a written symbol that cannot be recognized as a specific
character”. Perhaps there should be a universal convention about this,
but it is unrealistic to expect that to happen. The Unicode Standard can
hardly standardize such things. And if there were such a universal
symbol, it would surely have been encoded in Unicode—not because of its
meaning, but because of its consistent use as a character in plain text.

So I think the conclusion is that you should use established
conventions, if they exist, about using some symbol for such situations,
or define a convention as needed. You should not expect the character to
be recognized in this special meaning without such a higher-level
convention.

There’s a theoretical (?) problem with this. Let us assume that you
decide to use a particular character to represent “unknown character” in
your data, when working with some type of written texts. What happens
when you encounter, in the study of those text, a graphic symbol that is
best identified as the character you decided to use in that special
meaning? Well, I think you can decide to solve that problem if it ever
appears.

Yucca
Received on Mon Dec 26 2016 - 05:31:54 CST

This archive was generated by hypermail 2.2.0 : Mon Dec 26 2016 - 11:06:04 CST