Talk:Unicode equivalence
![]() | Typography C‑class Mid‑importance | |||||||||
|
![]() | Computing C‑class Mid‑importance | |||||||||
|
Technical tone
The tone of this article is really technical. babbage (talk) 04:23, 4 October 2009 (UTC)
Useful link
Is it OK to add a link to some software that I found it useful? It's called charlint and its a perl script that can be used for normalisation. It can be found at http://www.w3.org/International/charlint/ Wrecktaste (talk) 15:54, 21 June 2010 (UTC)
Redirect
Glyph Composition / Decomposition redirects here, but the term glyph is not used in this article. — Christoph Päper 15:50, 27 August 2010 (UTC)
Subset
Mathematically speaking, the compatible forms are subsets of the canonical ones. But that sentence is a bit confusing and should probably be rewritten. 213.100.90.101 (talk) 16:36, 11 March 2011 (UTC)
- Then please do so. I prefer a readable Unicode description. -DePiep (talk) 22:46, 11 March 2011 (UTC)
Rationale for equivalence
The following rationale was offered for why UNICODE introduced the concept of equivalence:
- it was desirable that two different strings in an existing encoding would translate to two different strings when translated to Unicode, therefore if any popular encoding had two ways of encoding the same character, Unicode needed to as well.
AFAIK, this is only part of the story. The main problem (duplicated chars and composed/decomposed ambiguity) was not inherited from any single prior standard, but from the merging of multiple standards with overlapping character sets.
One of the reasons was the desire to incorporate several preexisting character sets while preserving their encoding as much as possible, to simplify the migration from UNICODE. Thus, for example, the ISO-Latin-1 set is exactly incuded in the first 256 code positions, and several other national standards (Russian, Greek, Arabic, etc.) were included as well. Some attempt was made to eliminate duplication; so, for example, European punctuation is encoded only once (mostly in the Latin-1 segment). Still, some duplicates remained, such as the ANGSTROM SIGN (originating from a set of miscellaneous symbols) and the LETTER A WITH RING ABOVE (from Latin-1). Another reason was the necessary inclusion of combining diacritics: first, to allow for all possibly useful letter-accent combinations (such as the umlaut-n used by a certain rock band) without wasting an astronomical number of code points, and, second, because several preexisting standards used the decomposed form to represent accented letters. Yet another reason was to preserve traditional encoding distinctions between typographic forms of certain letters, for example the superscript and subscript digits of Latin-1, the ligatures of Postscript, Arabic, and other typographically-oriented sets, and the circled digits, half-width katakana and double-width Latin letters which had their own codes in standard Japanese charsets.
All these features meant that UNICODE would allow multiple encodings for identical or very similar characters, to a much greater degree than any previous standard --- thus negating the main advantage of a standard, and making text search a nightmare. Hence the need for the standard normal forms. Canonical equivalence was introduced to cope with the first two sources of ambiguity above, while compatibility was meant to address the last one. Jorge Stolfi (talk) 14:49, 16 June 2011 (UTC)