Re: Normalization rate on the Web

From: Bjoern Hoehrmann <>
Date: Mon, 21 Jan 2013 21:51:36 +0100

* Denis Jacquerye wrote:
>Does anybody have any idea of how much of the Web is normalized in NFC
>or NFD? Or how much not normalized?
>How would one find out or try to make a smart guess?

"How much" is not a good question here. Let's say there are only two web
pages: one is very short and used by only one person once a year, the
other page is very long and used by one billion people once per day. If
one of the two pages is in NFC, the other in NFD, it would be misleading
to say that "50% of the web" is in NFD. More realistically, let's say
Wikipedia articles are all neither NFC nor NFD, but Google search result
pages are all NFD. How much would that be?

That problem aside, you could always get a list of web sites or pages,
download them, or use a pre-packaged dataset, and analyse it.

My personal experience is that non-NFC content in german or english is
fairly rare; I can tell fairly easily because my smartphone cannot ren-
der various characters like german umlauts properly when decomposed, so
I encounter that problem sometimes, mainly on sites that quote heavily
from PDFs and similar content.

Björn Höhrmann · ·
Am Badedeich 7 · Telefon: +49(0)160/4415681 ·
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · 
Received on Mon Jan 21 2013 - 14:54:30 CST

This archive was generated by hypermail 2.2.0 : Mon Jan 21 2013 - 14:54:31 CST