To: UTC
	From: Mark Davis, Peter Edberg, John Jenkins 
	[tbd]
	Subject: Additions to Unihan needed for CLDR.
	Date: 2011-1-28
	 
	Certain fields in Unihan data are of major 
	importance for internationalization library implementations: notably the 
	total strokes and the pinyin readings, which are needed as a basis for 
	collation and other services. Unfortunately the current data fields are not 
	well suited to use in implementations, producing many results that do not 
	match common user expectations. This conclusion is based on bug reports from 
	the field, and review by native speakers.
	 
	The following presents a proposal from the 
	CLDR committee for improving the Unihan data by adding new fields and 
	changing the contents of some fields.
	 
	
		- Define the 
		kMandarin field to contain 
		the most customary pinyin reading for the character. When there are two 
		values, then the first is preferred for zh-Hans (CN) and the second is 
		preferred for zh-Hant (TW). If the values would be the same, there is 
		only one value. 
		- The preferred value is the one 
		most commonly used in modern text, with some preference given to 
		readings most likely to be in sorted lists.
- This redefinition of kMandarin can be 
		done because the kMandarin field never had a specific definition in 
		terms of other standards or works.
		- Define the 
		kTotalStrokes field to be 
		what is most appropriate for use with zh-Hant, and add a new field,
		kTotalSimplifiedStrokes, 
		to be what is most appropriate for use with zh-Hans (CN). There are thus 
		two different fields for the two different domains.
		- For each character, the 
		stroke count in China is fairly standardized, but there may be notable 
		differences in the order and number of strokes between China and the 
		rest of the Chinese world, for that character.
- The preferred value for each field is 
		the one most commonly associated with the character in modern text using 
		customary fonts, within that domain.
- The kTotalStrokes field 
		was defined to be the 
		value "for the character as drawn in the Unicode charts". But that is no 
		longer relevant or correct with multi-glyph charts, and the field can 
		thus be meaningfully repurposed.
		- Communicate to WG2 and the IRG 
		the importance of this information, and the need to supply it for all 
		new Han character encoding.
 
	The CLDR committee can provide initial data 
	for these fields based on a review and comparison against other sources such 
	as bihua and CNS. The data can then be improved over time.