Jump to content

Talk:Unicode equivalence

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Jorge Stolfi (talk | contribs) at 14:52, 16 June 2011 (Rational for equivalence: typos). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
WikiProject iconTypography C‑class Mid‑importance
WikiProject iconThis article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
CThis article has been rated as C-class on Wikipedia's content assessment scale.
MidThis article has been rated as Mid-importance on the importance scale.
WikiProject iconComputing C‑class Mid‑importance
WikiProject iconThis article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
CThis article has been rated as C-class on Wikipedia's content assessment scale.
MidThis article has been rated as Mid-importance on the project's importance scale.

Technical tone

The tone of this article is really technical. babbage (talk) 04:23, 4 October 2009 (UTC)[reply]

Is it OK to add a link to some software that I found it useful? It's called charlint and its a perl script that can be used for normalisation. It can be found at http://www.w3.org/International/charlint/ Wrecktaste (talk) 15:54, 21 June 2010 (UTC)[reply]

Redirect

Glyph Composition / Decomposition redirects here, but the term glyph is not used in this article. — Christoph Päper 15:50, 27 August 2010 (UTC)[reply]

Subset

Mathematically speaking, the compatible forms are subsets of the canonical ones. But that sentence is a bit confusing and should probably be rewritten. 213.100.90.101 (talk) 16:36, 11 March 2011 (UTC)[reply]

Then please do so. I prefer a readable Unicode description. -DePiep (talk) 22:46, 11 March 2011 (UTC)[reply]

Rationale for equivalence

The following rationale was offered for why UNICODE introduced the concept of equivalence:

it was desirable that two different strings in an existing encoding would translate to two different strings when translated to Unicode, therefore if any popular encoding had two ways of encoding the same character, Unicode needed to as well.

AFAIK, this is only part of the story. The main problem (duplicated chars and composed/decomposed ambiguity) was not inherited from any single prior standard, but from the merging of multiple standards with overlapping character sets.
One of the reasons was the desire to incorporate several preexisting character sets while preserving their encoding as much as possible, to simplify the migration from UNICODE. Thus, for example, the ISO-Latin-1 set is exactly incuded in the first 256 code positions, and several other national standards (Russian, Greek, Arabic, etc.) were included as well. Some attempt was made to eliminate duplication; so, for example, European punctuation is encoded only once (mostly in the Latin-1 segment). Still, some duplicates remained, such as the ANGSTROM SIGN (originating from a set of miscellaneous symbols) and the LETTER A WITH RING ABOVE (from Latin-1). Another reason was the necessary inclusion of combining diacritics: first, to allow for all possibly useful letter-accent combinations (such as the umlaut-n used by a certain rock band) without wasting an astronomical number of code points, and, second, because several preexisting standards used the decomposed form to represent accented letters. Yet another reason was to preserve traditional encoding distinctions between typographic forms of certain letters, for example the superscript and subscript digits of Latin-1, the ligatures of Postscript, Arabic, and other typographically-oriented sets, and the circled digits, half-width katakana and double-width Latin letters which had their own codes in standard Japanese charsets.
All these features meant that UNICODE would allow multiple encodings for identical or very similar characters, to a much greater degree than any previous standard --- thus negating the main advantage of a standard, and making text search a nightmare. Hence the need for the standard normal forms. Canonical equivalence was introduced to cope with the first two sources of ambiguity above, while compatibility was meant to address the last one. Jorge Stolfi (talk) 14:49, 16 June 2011 (UTC)[reply]