Re: a character for an unknown character from Philippe Verdy on 2016-12-23 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 24 Dec 2016 00:53:35 +0100

I would not bet at all that 17th century texts do not have any bullets,
even if they were still not encoded in Unicode. In fact there are MANY
bullet like symbols in old manuscripts, and OCR could also "find" many of
them, by not being able to distinguish various dots. In fact I've seen
bullet like symbols used for meaning zero, or as placeholders meaning "N/A".

17th century manuscripts and books are full of various decorative glyphs.
And the bullets are so easy to confuse... Take an old Occitan manuscripts
you'll find them in various places. Consider Asian scripts, you could
confuse them with full stops, or other diacritics. Consider Arabic or
Hebrew and you'll probably confuse them with vowel points.

So if you really want to include a plaeholder for damaged/missing parts in
old documents, the large geometric shapes are probably best to use and will
be obviously unconfusable with other dots and readers will obviously know
that this is a placeholder for some missing/destroyed content.

An alternative commonly used is also to use "[...]", a convention often
used in citations when some parts of the sentence are voluntarily omitted
by the redactor.

2016-12-23 22:19 GMT+01:00 Martin Mueller <martinmueller_at_northwestern.edu>:

> That’s excellent advice. In our project somebody confused the bullet with
> the black circle. It didn’t matter, because 17th century texts don’t have
> bullet symbols—at least not the ones we’re dealing with. But following your
> advice would significantly reduce ambiguity
>
>
>
> *From: *<verdyp_at_gmail.com> on behalf of Philippe Verdy <verdy_p_at_wanadoo.fr
> >
> *Reply-To: *Philippe Verdy <verdy_p_at_wanadoo.fr>
> *Date: *Friday, December 23, 2016 at 1:35 PM
> *To: *Martin Mueller <martinmueller_at_northwestern.edu>
> *Cc: *William_J_G Overington <wjgo_10009_at_btinternet.com>, "
> unicode_at_unicode.org" <unicode_at_unicode.org>
> *Subject: *Re: a character for an unknown character
>
>
>
> if you want something that is very unlikely to be present in original
> texts, it would be preferable to avoid the black dot or any other bullets
> which may be used as punctuation marks.
>
>
>
> Consider using some geometric shape, notably those inherited from DOS code
> pages, such as the filled square U+2588 (█). It is mapped in many common
> fonts, only because it is part of legacy code page 437 (at position
> 0xDB=219 decimal) and most other codepages for MSDOS. It may be used in
> legacy encoded texts for MSDOS but only for presentation purpose (using
> monospaced fonts for text-only terminals) where it should not match any use
> for missing/damaged parts of an original document printed/handwritten
> document on paper (those DOS texts should have no original version on
> paper, they are originately only in encoded files on computers).
>
>
>
> It is easily entered on keyboards using Alt+219 (**not** Alt+0219) on
> Windows (it works using the current OEM 8-bit codepage, which may be CP437,
> CP850 or similar).
>
>
>
> There's also the half-filled square U+2584 (▄), at position 0xDC=218
> decimal in CP437/CP850 (i.e. Alt+218 on Windows keyboards) if you want to
> avoid filling the full lineheight and being able to discriminate multiple
> rows of text.
>
>
>
> Or the filled squared with dark grey pattern U+2593 (▓), at position
> 0xB2=178 (i.e. Alt+178 on Windows keyboards) if you want to still see it
> with text selection. Its gray pattern is also intuitively meaning "missing
> part".
>
>
>
> All these geometric shapes are symbols, not punctuations, and very
> unlikely to be used as bullet punctuations in documents and not confusable
> with any other characters for actual text. They are also ignored in plain
> text searches, i.e. not considered as variants of a significant dot, and
> there's also a word break before and after them (so they won't collapse
> into surrounding words written before or after them). They are also
> typically used to replace words that have been voluntarily deleted/hidden
> from an original document (becaue there's a need for keeping this info
> private).
>
>
>
> But note that input fields for entering password or secret codes in
> application forms/dialogs are typically using black bullets U+2022 (•) or
> simply ASCII asterisks U+002A (*) to replace the entered characters: they
> cannot be read, but the user knows what he is entering on his keyboard.
>
>
>
>
>
>
>
>
>
>
>
> 2016-12-23 0:35 GMT+01:00 Martin Mueller <martinmueller_at_northwestern.edu>:
>
> These are very handsome and interesting. But for the purposes of my
> project, which involves folks here, there, and everywhere working on
> editorial problems relating to digital transcriptions of Early Modern
> texts, the cardinal requirement is that the character can be found on and
> deployed from any Windows, Linux, or OS 10 machin. We have used the black
> dot (\u25cf) as a kludge. Since it does not occur in the source data, there
> is no ambiguity. It is relatively easy to produce on a keyboard. From a
> visual perspective it is preferable to the diamond with a question
> mark—although that is semantically more obvious. But it is visually very
> disruptive, and it is much harder to find on a standard character map than
> the black dot, which is predictably located in geometrical shapes.
>
> It’s a kludge, but it works, and it looks to me superior to any of the
> alternatives. But I can be persuaded otherwise.
>
> With thanks for the help of all of you
>
> MM
>
> On 12/22/16, 6:03 AM, "William_J_G Overington" <wjgo_10009_at_btinternet.com>
> wrote:
>
> Martin Mueller wrote:
>
> > Is there a Unicode character that says “I represent an
> alphanumerical character, but I don’t know which”. This is a very common
> problem in the transcription of historical texts where you have lacunas.
>
> I have been reading this thread with interest.
>
> I have produced nine designs for glyphs.
>
> If you so choose, you can assign specific meanings to one, some, or
> all of them. If you need more than nine designs please say.
>
> Please find attached nine .png files, one glyph design in each file.
>
> The size of each of the images and the names of the files follow the
> following specification.
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=
> yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=
> rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=
> rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=
> VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=
>
> However the images are not congruently in accordance with those rules
> as there is a one pixel width transparent surround as the designs were made
> using filled rectangles upon a theoretical seven row by seven column
> arrangement of blocks, each block ten pixels by ten pixels. I used the
> Serif PagePlus X7 desktop publishing program.
>
> The characters are not intended as emoji, I just applied the above
> specification as it is convenient to make the designs compatible with that
> specification as far as possible.
>
> I have assigned Private Use Area code points of U+EA60 through to
> U+EA68 to the glyphs. The specific code point for each glyph is indicated
> in the file name of the image of that glyph.
>
> I have chosen those code points as the Alt codes for U+EA60 through to
> U+EA68 are Alt 60000 through to Alt 60008 respectively. My thinking being
> that if the designs are implemented in fonts that those easy to remember
> Alt codes might be helpful to someone using the Microsoft WordPad program.
>
> I checked that those code points are not being used in the Medieval
> Unicode Font Initiative.
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__skaldic.
> abdn.ac.uk_db.php-3Fcp-3DEA-26if-3Dmufi-26table-3Dmufi-5Fchar&d=CwIFaQ&c=
> yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=
> rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=
> rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=z5-
> Sl6Aw2Dr0dYsoZ9xgzqCpXjzoot1TnwUrJKqNHpo&e=
>
> Readers who so choose are welcome to implement these glyphs in fonts.
>
> The https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=
> yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=
> rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=
> rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=
> VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e= specification mentions
> licensing. For the avoidance of doubt these designs are free to share and
> use.
>
>
> A Private Use Area solution is not ideal, yet may be helpful in
> getting things started and could be helpful in establishing usage, which
> could help in getting the characters implemented into regular Unicode.
>
> I am attaching the images to this email. The nature of the email
> system is that the order of the images might not be in the order of the
> code points, yet each image has an indication of the code point within its
> name so that information should help to resolve any such problem in the
> transmission of the email attachments.
>
> William Overington
>
> Thursday 22 December 2016
>
>
>
>
>
Received on Fri Dec 23 2016 - 17:54:20 CST

This archive was generated by hypermail 2.2.0 : Fri Dec 23 2016 - 17:54:20 CST