PRI #316 Background: Proposal to Remove Some Hira/Kata From Script_Extensions

The Script_Extensions property values for some characters contain Hiragana, Katakana, or Bopomofo, when they should only contain Han. The UTC is considering removing the Hiragana, Katakana, or Bopomofo in these cases, and would like feedback as to any that should not be changed, and any others that should be.

Mistaken Script_Extensions values cause false positives in confusability code and other processing. For example, it causes the following to be considered whole-script confusables:

ー         U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

㇐         U+31D0 CJK STROKE H

Hiragana and Katakana are, of course, part of the Japanese writing systems, which also uses Han. But the following characters are not part of the Hiragana and Katakana scripts, and should have those scripts removed from their Script_Extensions values. Similarly, Bopomofo should be removed where it appears below.

The list excludes characters that don’t contain ideographs, or CJK strokes. However, it includes a few that others that appear to be specifically for use with ideographics, like IDEOGRAPHIC ANNOTATION LINKING MARK or IDEOGRAPHIC VARIATION INDICATOR, and don’t seem particularly likely to be interspersed with pure Hiragana or Katakana text. In review, please pay special attention to those characters.

Doing the analysis also picked up 6 circled ideographic characters that have Script_Extensions=Common when they probably should have Script_Extensions=Han, so those are also included.

Proposed Changes

  1. For the following 196 characters, change the Script_Extensions value

from:        Bopomofo,Han,Hangul,Hiragana,Katakana

to:        Han,Hangul

303E ;        IDEOGRAPHIC VARIATION INDICATOR

303F ;        IDEOGRAPHIC HALF FILL SPACE

31C0..31E3 ;        CJK STROKE T
.. CJK STROKE Q

3220..3243 ;        PARENTHESIZED IDEOGRAPH ONE
.. PARENTHESIZED IDEOGRAPH REACH

3280..32B0;        CIRCLED IDEOGRAPH ONE

.. CIRCLED IDEOGRAPH NIGHT

32C0..32CB ;        IDEOGRAPHIC TELEGRAPH SYMBOL FOR JANUARY
.. IDEOGRAPHIC TELEGRAPH SYMBOL FOR DECEMBER

3358..3370 ;        IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ZERO
.. IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR TWENTY-FOUR

337B..337F ;        SQUARE ERA NAME HEISEI .. SQUARE CORPORATION

33E0..33FE ;        IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ONE
.. IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY THIRTY-ONE

  1. For the following 16, change the Script_Extensions value

from:        Han,Hiragana,Katakana

to:        Han

3190..319F ;        IDEOGRAPHIC ANNOTATION LINKING MARK
.. IDEOGRAPHIC ANNOTATION MAN MARK

  1. For the following 6, change the Script_Extensions value

from:        Common

to:        Han

3244..3247 ;        CIRCLED IDEOGRAPH QUESTION
.. CIRCLED IDEOGRAPH KOTO

1F250 ;        CIRCLED IDEOGRAPH ADVANTAGE

1F251 ;        CIRCLED IDEOGRAPH ACCEPT

The full set of characters that would be affected is:

[〾〿㆐-㆟㇀-㇣㈠-㉇㊀-㊰㋀-㋋㍘-㍰ ㍻-㍿㏠-㏾🉐🉑]

Comparison

For comparison, the following list includes other characters whose Script_Extensions values contain Han and Hiragana, Katakana, or Bopomofo. These are not currently part of the proposal, but we’d like feedback as to whether any should be.

Script_Extensions=Bopomofo,Han        items: 4

302A..302D ;        IDEOGRAPHIC LEVEL TONE MARK
.. IDEOGRAPHIC ENTERING TONE MARK        // GC=NSM

Script_Extensions=Bopomofo,Han,Hangul,Hiragana,Katakana        items: 10

3003 ;        DITTO MARK

3013 ;        GETA MARK        // GC=Other_Symbol

301C..301F ;        WAVE DASH
..LOW DOUBLE PRIME QUOTATION MARK

3030 ;        WAVY DASH

3037 ;        IDEOGRAPHIC TELEGRAPH LINE FEED SEPARATOR SYMBOL        // GC=Other_Symbol

FE45 ;        SESAME DOT

FE46 ;        WHITE SESAME DOT

Script_Extensions=Bopomofo,Han,Hangul,Hiragana,Katakana,Yi        items: 26

3001 ;        IDEOGRAPHIC COMMA

3002 ;        IDEOGRAPHIC FULL STOP

3008..3011 ;        LEFT ANGLE BRACKET
.. RIGHT BLACK LENTICULAR BRACKET

3014..301B ;        LEFT TORTOISE SHELL BRACKET
.. RIGHT WHITE SQUARE BRACKET

30FB ;        KATAKANA MIDDLE DOT

FF61..FF65 ;        HALFWIDTH IDEOGRAPHIC FULL STOP
.. HALFWIDTH KATAKANA MIDDLE DOT

Script_Extensions=Han,Hiragana,Katakana        items: 3

3006 ;        IDEOGRAPHIC CLOSING MARK        // GC=Other_Letter

303C ;        MASU MARK        // GC=Other_Letter

303D ;        PART ALTERNATION MARK

The full set of 43 comparison characters is:

[〪-〭 〜 ・・ 、、 ﹅ ﹆ 。。 〝-〟 〈-「「 」」 『-】 〔-〛 〃 〽 〰 〓 〷 〼 〆]