Talk:Unicode/Archive 7
![]() | This is an archive of past discussions about Unicode. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | ← | Archive 5 | Archive 6 | Archive 7 |
Number of issues.
I just now edited the Issues section by including the number of identified "issues" with characters (codepoints) (there are, by my count of them in the April 2017 document cited, 94 of them.) I will include them as a copy&paste (with minor editing for brevity) from that article here, it may be helpful.
- U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE not usually considered a single letter.
- U+01A2 LATIN CAPITAL LETTER OI LATIN CAPITAL LETTER GHA, not OI
- U+01A3 LATIN SMALL LETTER OI LATIN SMALL LETTER GHA, not oi
- U+01BE LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE ligation of "ts"; not an inverted glottal stop
- U+0238 LATIN SMALL LETTER DB DIGRAPH ligature, not a digraph
- U+0239 LATIN SMALL LETTER QP DIGRAPH ligature, not a digraph
- U+025B LATIN SMALL LETTER OPEN E Latin small letter epsilon [ idk if it is "open" or "closed" see U+025E]
- U+025E LATIN SMALL LETTER CLOSED REVERSED OPEN E Latin small letter closed reversed epsilon (reversed form of U+025B).
- U+0285 LATIN SMALL LETTER SQUAT REVERSED ESH reversed fishhook r with retroflex hook.
- U+02C7 CARON hacek
- U+030C COMBINING CARON combining hacek
- U+034F COMBINING GRAPHEME JOINER incorrect discription of function; it does not join graphemes
- U+039B GREEK CAPITAL LETTER LAMDA preferably, but not necessarily, GREEK CAPITAL LETTER LAMBDA
- U+03BB GREEK SMALL LETTER LAMDA preferably, but not necessarily, GREEK SMALL LETTER LAMBDA
- U+04A5 CYRILLIC SMALL LIGATURE EN GHE not a decomposable ligature
- U+04B5 CYRILLIC SMALL LIGATURE TE TSE not a decomposable ligature
- U+04D5 CYRILLIC SMALL LIGATURE A IE not a decomposable ligature
- U+0598 HEBREW ACCENT ZARQA Misleading, probably should have been called Hebrew accent tsinnorit
- U+05AE HEBREW ACCENT ZINOR Should have been called Hebrew accent zarqa (= tsinor)
- U+0670 ARABIC LETTER SUPERSCRIPT ALEF Not an Arabic letter, but a vowel sign.
- U+06C0 ARABIC LETTER HEH WITH YEH ABOVE not a letter but a ligature
- U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE not a letter but a ligature
- U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE not a letter but a ligature
- U+0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT SYRIAC SUBLINEAR COLON SKEWED LEFT
- U+0964 DEVANAGARI DANDA Despite the fact that these characters have "DEVANAGARI" in their names, these punctuation marks are intended for common use for the scripts of India.
- U+0965 DEVANAGARI DOUBLE DANDA Despite the fact that these characters have "DEVANAGARI" in their names, these punctuation marks are intended for common use for the scripts of India.
- U+0A01 GURMUKHI SIGN ADAK BINDI GURMUKHI SIGN ADDAK BINDI
- U+0B83 TAMIL SIGN VISARGA This character is actually the aaytham, and is not used as a visarga in Tamil.
- U+0CDE KANNADA LETTER FA There is no Kannada letter 'fa', this character represents the syllable 'llla'. A formal alias correcting this error has been defined.
- U+0E9D LAO LETTER FO TAM The name for this character should have been fo sung, but that name is already used for U+0E9F. A formal alias LAO LETTER FO FON correcting this error has been defined.
- U+0E9F LAO LETTER FO SUNG The name for this character should have been fo tam, but that name is already used for U+0E9D. A formal alias LAO LETTER FO FAY correcting this error has been defined.
- U+0EA3 LAO LETTER LO LING The name for this character should have been lo loot, but that name is already used for U+0EA5. A formal alias LAO LETTER RO correcting this error has been defined.
- U+0EA5 LAO LETTER LO LOOT The name for this character should have been lo ling, but that name is already used for U+0EA3. A formal alias LAO LETTER LO correcting this error has been defined.
- U+0F0A TIBETAN MARK BKA- SHOG YIG MGO This character is used to indicate that a document is addressed to a superior (the "petition honorific"), but the Tibetan name actually indicates a superior addressing an inferior ("starting flourish for giving a command").
- U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG The tsheg mark is not restricted to intersyllabic usage, and would have been better named Tibetan mark tsheg.
- U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR This character is not a delimiter, but is a non-breaking version of the tsheg mark (U+0F0B) that is used exclusively between the letter NGA (U+0F44) and the shad mark (U+0F0D).
- U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN The syllable "BSKA-" does not occur naturally in Tibetan, and is a mistake for "BKA-" (cf. U+0F0A). A formal alias correcting this error has been defined.
- U+11EC HANGUL JONGSEONG IEUNG-KIYEOK U+11EC HANGUL JONGSEONG YESIEUNG-KIYEOK
- U+11ED HANGUL JONGSEONG IEUNG-SSANGKIYEOK U+11ED HANGUL JONGSEONG YESIEUNG-SSANGKIYEOK
- U+11EE HANGUL JONGSEONG SSANGIEUNG U+11EE HANGUL JONGSEONG SSANGYESIEUNG
- U+11EF HANGUL JONGSEONG IEUNG-KHIEUKH U+11EF HANGUL JONGSEONG YESIEUNG-KHIEUKH
- U+156F CANADIAN SYLLABICS TTH There is no 'tth' syllable. A better name would have been Canadian Syllabics asterisk.
- U+178E KHMER LETTER NNO As this character belongs to the first register, its correct transliteration is nna, not NNO.
- U+179E KHMER LETTER SSO As this character belongs to the first register, its correct transliteration is ssa, not SSO.
- U+200B ZERO WIDTH SPACE This isn't a "space". It is an invisible character that can be used to provide line break opportunities.
- U+2113 SCRIPT SMALL L Despite its character name, this symbol is derived from a special italicized version of the small letter "L".
- U+2118 SCRIPT CAPITAL P Should have been called calligraphic small p or Weierstrass elliptic function symbol, which is what it is used for. It is not a capital "P" at all. A formal alias correcting this to WEIERSTRASS ELLIPTIC FUNCTION has been defined.
- U+234A APL FUNCTIONAL SYMBOL DOWN TACK UNDERBAR named according to the Bosworth convention. Inconsistent with current APL specifications & the London convention; the names of these five symbols no longer match APL usage for up and down.
- U+234E APL FUNCTIONAL SYMBOL DOWN TACK JOT named according to the Bosworth convention. Inconsistent with current APL specifications & the London convention; the names of these five symbols no longer match APL usage for up and down.
- U+2351 APL FUNCTIONAL SYMBOL UP TACK OVERBAR named according to the Bosworth convention. Inconsistent with current APL specifications & the London convention; the names of these five symbols no longer match APL usage for up and down.
- U+2355 APL FUNCTIONAL SYMBOL UP TACK JOT named according to the Bosworth convention. Inconsistent with current APL specifications & the London convention; the names of these five symbols no longer match APL usage for up and down.
- U+2361 APL FUNCTIONAL SYMBOL UP TACK DIAERESIS named according to the Bosworth convention. Inconsistent with current APL specifications & the London convention; the names of these five symbols no longer match APL usage for up and down.
- U+2448 OCR DASH MICR ON US SYMBOL
- U+2449 OCR CUSTOMER ACCOUNT NUMBER MICR DASH SYMBOL
- U+2629 CROSS OF JERUSALEM cross potent. The actual cross of Jerusalem is a cross potent with a small crosslet added at each corner.
- U+262B FARSI SYMBOL This symbol is so named because as symbol of Iran it cannot be encoded in ISO standards.
- U+2B7A LEFTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE HORIZONTAL STROKE LEFTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE VERTICAL STROKE
- U+2B7C RIGHTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE HORIZONTAL STROKE RIGHTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE VERTICAL STROKE
- U+3021 HANGZHOU NUMERAL ONE HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
- U+3022 HANGZHOU NUMERAL TWO HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
- U+3023 HANGZHOU NUMERAL THREE HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
- U+3024 HANGZHOU NUMERAL FOUR HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
- U+3025 HANGZHOU NUMERAL FIVE HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
- U+3026 HANGZHOU NUMERAL SIX HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
- U+3027 HANGZHOU NUMERAL SEVEN HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
- U+3028 HANGZHOU NUMERAL EIGHT HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
- U+3029 HANGZHOU NUMERAL NINE HANGZHOU is a misnomer. The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms
- U+327C CIRCLED KOREAN CHARACTER CHAMKO An instance of inconsistent transliterations, resulting from irreconciled North/South Korean positions.
- U+327D CIRCLED KOREAN CHARACTER JUEUI An instance of inconsistent transliterations, resulting from irreconciled North/South Korean positions.
- U+A015 YI SYLLABLE WU a syllable iteration mark, not a syllable "wu"
- U+FA0E CJK COMPATIBILITY IDEOGRAPH-FA0E unified CJK ideograph, not compatibility ideograph
- U+FA0F CJK COMPATIBILITY IDEOGRAPH-FA0F unified CJK ideograph, not compatibility ideograph
- U+FA11 CJK COMPATIBILITY IDEOGRAPH-FA11 unified CJK ideograph, not compatibility ideograph
- U+FA13 CJK COMPATIBILITY IDEOGRAPH-FA13 unified CJK ideograph, not compatibility ideograph
- U+FA14 CJK COMPATIBILITY IDEOGRAPH-FA14 unified CJK ideograph, not compatibility ideograph
- U+FA1F CJK COMPATIBILITY IDEOGRAPH-FA1F unified CJK ideograph, not compatibility ideograph
- U+FA21 CJK COMPATIBILITY IDEOGRAPH-FA21 unified CJK ideograph, not compatibility ideograph
- U+FA23 CJK COMPATIBILITY IDEOGRAPH-FA23 unified CJK ideograph, not compatibility ideograph
- U+FA24 CJK COMPATIBILITY IDEOGRAPH-FA24 unified CJK ideograph, not compatibility ideograph
- U+FA27 CJK COMPATIBILITY IDEOGRAPH-FA27 unified CJK ideograph, not compatibility ideograph
- U+FA28 CJK COMPATIBILITY IDEOGRAPH-FA28 unified CJK ideograph, not compatibility ideograph
- U+FA29 CJK COMPATIBILITY IDEOGRAPH-FA29 unified CJK ideograph, not compatibility ideograph
- U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET A spelling error: "brakcet" should be "bracket". A formal alias correcting this error has been defined.
- U+FEFF ZERO WIDTH NO-BREAK SPACE Byte Order Mark (Naming it ZWNBSP was a mistake from the start.)
- U+122D4 CUNEIFORM SIGN SHIR TENU CUNEIFORM SIGN NU11 TENU
- U+122D5 CUNEIFORM SIGN SHIR OVER SHIR BUR OVER BUR CUNEIFORM SIGN NU11 OVER NU11 BUR OVER BUR
- U+1B001 HIRAGANA LETTER ARCHAIC YE The preferred name is HENTAIGANA LETTER E-1
- U+1D0C5 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS U+1D0C5 BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
- U+1D300 MONOGRAM FOR EARTH U+1D300 MONOGRAM FOR HUMAN
- U+1D301 DIGRAM FOR HEAVENLY EARTH U+1D301 DIGRAM FOR HEAVENLY HUMAN
- U+1D302 DIGRAM FOR HUMAN EARTH U+1D302 DIGRAM FOR EARTHLY HUMAN
- U+1D303 DIGRAM FOR EARTHLY HEAVEN U+1D303 DIGRAM FOR HUMANLY HEAVEN
- U+1D304 DIGRAM FOR EARTHLY HUMAN U+1D304 DIGRAM FOR HUMANLY EARTH
- U+1D305 DIGRAM FOR EARTH U+1D305 DIGRAM FOR HUMANLY HUMAN
--sorry my Copy& paste does not retain the two columns there were in. 75.90.36.201 (talk) 20:06, 9 April 2018 (UTC)
- I formatted your previous edit, then counted the number of asterisks ("*"s) in the source text. 94 seems to be the correct number. However this is on the verge of Original research. And will you track changes made to Unicode Technical Note #27? Love —LiliCharlie (talk) 20:33, 9 April 2018 (UTC)
- I understand that there are (at least) 12 code-points which represent non-existent "characters" (also known as "ghost characters"). 妛挧暃椦槞蟐袮閠駲墸壥彁 are according to https://www.dampfkraft.com/by-id/a824aa10/#A-Spectre-is-Haunting-Unicode meaningless and NOT part of any language. In addition (as of 3/12/2018) parts of the issues section have been removed which in my view amounts to vandalism. The most egregious removal is all mention of the (politically motivated) concessions Unicode Consortium made to various nations because they claimed (rather than the experts of the relevant languages) to be the authoritative source of the language. The current article white-washes this (to some extent) by implying that some of these disagreements are over "ancient" or "obsolete" language elements when in fact some of them are in current (but "unofficial") use. Also, I vote that the 94 (or 106 if the above dozen aren't included) issues should be listed in the article (as a collapsed table, sortable by code-point name or U-number.72.16.99.93 (talk) 22:18, 3 December 2018 (UTC)
- None of this is relevant. A bunch of these are controversial; after much discussion, the Wikipedia article is at caron, not hacek. The complaint you have about the APL characters says "named according to the Bosworth convention", which is a choice, not a mistake. Even the clear errors are irrelevant; we barely mention that Byzantine music and hentaigana are supported, thus stressing about the naming of one of the characters, a name that will have little effect on users, is beneath mention. Nobody will use 10% of Unicode's characters; it's not a real problem that there are 12 characters that have no real use.
- Editing the issues section is not vandalism; it's people disagreeing with you. I'm not even sure what you're talking about; the last three months has had no changes to the issue section.
- (Please don't use xx/xx/20xx date formats; they're inherently ambiguous, as a significant number of readers will interpret them as month/day and a significant number will interpret them as day/month.)--Prosfilaes (talk) 21:46, 24 December 2018 (UTC)
- I understand that there are (at least) 12 code-points which represent non-existent "characters" (also known as "ghost characters"). 妛挧暃椦槞蟐袮閠駲墸壥彁 are according to https://www.dampfkraft.com/by-id/a824aa10/#A-Spectre-is-Haunting-Unicode meaningless and NOT part of any language. In addition (as of 3/12/2018) parts of the issues section have been removed which in my view amounts to vandalism. The most egregious removal is all mention of the (politically motivated) concessions Unicode Consortium made to various nations because they claimed (rather than the experts of the relevant languages) to be the authoritative source of the language. The current article white-washes this (to some extent) by implying that some of these disagreements are over "ancient" or "obsolete" language elements when in fact some of them are in current (but "unofficial") use. Also, I vote that the 94 (or 106 if the above dozen aren't included) issues should be listed in the article (as a collapsed table, sortable by code-point name or U-number.72.16.99.93 (talk) 22:18, 3 December 2018 (UTC)
Other persisting "anomalies"
The "combining class" priorities assigned to Hebrew diacritics in the early 1990s are incorrect and semi-worthless, which means that older software displays the diacritics incorrectly, while more recent software has to work around it, but apparently this is also set in stone, and nothing can be done to fix it... AnonMoos (talk) 03:06, 4 February 2019 (UTC)
Names or glyphs? Response to Prosfilaes
Prosfilaes has reverted my replacement of code point names with glyphs, holding that "in explaining the architectures, names are more important than glyphs". I disagree. The official names play no role in the structure of Unicode. Some code points like U+0009, the tab character, do not even have official names and, of those that do, some are incorrect (see above) and others, like LATIN SMALL LETTER Q (which displays a capital letter that seemingly claims to be small) are confusing. The Unicode Standard nowhere says that anything depends on the name of a code point.
A code point with a graphic "basic type", which most of the assigned code points have, determines the general shape of its associated glyphs. The additional designation of a font makes the shape precise, and adding the point size completes the glyph specification. Code points are of interest mainly because of this association with glyphs.
In lower case, the Greek letter sigma has two code points, U+03C2 and U+03C3. The second applies when the letter occurs at the end of a word, the first when it occurs elsewhere. Why two, when it's the same letter, pronounced the same way? Only because the shape is not even roughly the same, ς for U+03C2 and σ for U+03C3. Glyphs that differ so radically can never represent the same code point. Unlike anything having to do with official names, this is a basic feature of Unicode architecture.
In contrast, the exclamation mark ' ! ' is used for the factorial function in mathematics as well as a punctuation mark ending a sentence emphatically. These are two very different uses with nothing in common but the glyph in each applicable font, yet they have the same code point, U+0021. They are not distinguished in Unicode because the distinction has no consequence for glyphs.
One cannot always use a glyph to designate a code point uniquely. The glyph ' P ' can represent U+0050 (the first letter in Prosfilaes' username and mine), U+03A1 (the Greek letter rho), or U+0420 (the Cyrillic letter er). Unique designation is usually possible, though, and—when it is—presenting glyphs as I did in the reverted text is more helpful to the average reader than is presenting the name.
Prosfilaes also complains that 𑀈, my example of a non-BMP character, looks too much like a plus sign, which is in the BMP. That hadn't occurred to me, but another non-BMP code point could certainly be used.
Peter Brown (talk) 16:55, 20 February 2019 (UTC)
- Unicode encodes characters, not glyphs. Identical glyphs may be used to represent different characters (as, typically, U+0041 A LATIN CAPITAL LETTER A, U+0391 Α GREEK CAPITAL LETTER ALPHA, and U+0410 А CYRILLIC CAPITAL LETTER A), and completely different glyphs may represent the same character (U+0041 A LATIN CAPITAL LETTER A may look like 𝖠, 𝒜, 𝔄, etc.).
- Specifically, typical glyphs representing the character U+00F7 ÷ DIVISION SIGN can easily be confused with U+2797 ➗ HEAVY DIVISION SIGN, U+1365 ፥ ETHIOPIC COLON or U+223B ∻ HOMOTHETIC, while the "two-dot shape" of U+11008 𑀈 BRAHMI LETTER II looks like U+A58C ꖌ VAI SYLLABLE JOO, and its "four-dot shape" resembles U+2E2C ⸬ SQUARED FOUR DOT PUNCTUATION, U+2237 ∷ PROPORTION, U+26DA ⛚ DRIVE SLOW SIGN, U+2D46 ⵆ TIFINAGH LETTER TUAREG YAKH, U+1362 ። ETHIOPIC FULL STOP, and several of the Braille patterns.
- There is no way to confidently identify an isolated character when you only see a glyph that visualises it. It is necessary to give its semantics which in most cases is reflected by its character name. Love —LiliCharlie (talk) 18:54, 20 February 2019 (UTC)
- I think this is far over broad for the edit in question. The dispute is between "For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD)." and " For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g. U+00F7 for the character ÷); for code points outside the BMP, five or six digits are used, as required (e.g. U+11008 for the character 𑀈)." As I said, this is about architecture. Yes, there are confusing names for certain Unicode characters, but this is about how many digits are used to represent that Unicode character. It doesn't matter at this layer if a name is confusing or how it might map to glyphs or user-perceived characters; just that there exists a code point labeled LATIN CAPITAL LETTER X and that it is also referenced as U+0058.--Prosfilaes (talk) 21:20, 20 February 2019 (UTC)
- Overbroad, perhaps, but I do want to respond to LiliCharlie, who claimed that "completely different glyphs may represent the same character", challenging my claim that "Glyphs that differ...radically can never represent the same code point." As support, LiliCharlie writes, "U+0041 A LATIN CAPITAL LETTER A may look like 𝖠, 𝒜, 𝔄, etc." This is supportive, however, only if LiliCharlie can name fonts in which U+0041 has 𝒜 and 𝔄, respectively, as glyphs. As far as I can determine, 𝒜 has code point U+1D49C and 𝔄 has code point U+1D504. A font in which U+0041 has 𝒜 as a glyph would hardly be a sufficient challenge anyhow, as this is quite similar to A. 𝔄 admittedly differs radically, so a font in which it represents U+0041 would definitely count against my claim.
- I challenge LiliCharlie to explain why, in lower case, medial sigma (σ) and final sigma (ς) are assigned different code points while medial and final lower-case theta (θ) both have the one code point U+03D8. The obvious answer, though there may be another, is that the glyphs for lower-case sigma, in most or all applicable fonts, are very different.
- Returning to the original dispute with Prosfilaes, the choice is between
- For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).
- and
- For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g. U+00F7 for the character ÷); for code points outside the BMP, five or six digits are used, as required (e.g. U+11008 for the character 𑀈).
- At least for the sake of argument, I concede Prosfilaes' point that 𑀈 looks too much like a plus sign and propose, in the final parenthesis,
- e.g. U+2395C for the character 𣥜
- The real question is which phrasing is more accommodating for the typical reader. I disagree that "It doesn't matter at this layer if a name is confusing"; confusing names will confuse, which contravenes Wikipedia's objectives. The use of capitals is off-putting, especially as the reader has not been advised earlier (or, indeed, anywhere in the article) that letters in official Unicode names have to be capitalized. There is no explanation of what a language tag is; the phrase is simply sprung on the unsuspecting reader. Likewise with "private use", a phrase appearing in a quote from Joe Becker but never explained.
- For readers already acquainted with Unicode conventions, these considerations are not relevant. Such folks, however, are not the intended audience for Wikipedia articles.
- Luthersche Fraktur was the first one I found, and a search for Fraktur fonts show that many of them use the glyph form 𝔄.
- We're talking about code points, not characters. You're adding confusion by saying "For code points" and then saying "the character ÷". I think you're underestimating the type of reader who is reading this article, or underestimating the difficulty of the rest of the article. The fact that names are capitalized is something that you learn about Unicode by exposure, and again, for the audience, is something they'll just absorb. Anyone with any familiarity with character encoding in computers will expect that there's control characters in Unicode, like LANGUAGE TAG.
- I object to the use of 𣥜, since that implies that Chinese is outside the BMP. Hieroglyphs or other clearly ancient script, that's completely outside the BMP, should be used, or possibly an emoji. You're giving up the ability to show a six-digit name if you insist on using characters.--Prosfilaes (talk) 01:13, 22 February 2019 (UTC)
@Peter Brown: 1. There are two major reasons why U+03C2 ς GREEK SMALL LETTER FINAL SIGMA and U+03C3 σ GREEK SMALL LETTER SIGMA were encoded separately. The first, and already sufficient, one was to ensure round-trip compatibility with encodings that had existed before Unicode, and in which the two characters were also encoded separately. And reason number two is that there are exceptions to the rule that ⟨ς⟩ is used word-finally and ⟨σ⟩ elsewhere, see Nick Nicholas's Sigma: final vs. non-final which is part of the Thesaurus Linguae Graecae project. — 2. The Fraktur smart font I most often use is UnifrakturMaguntia. Its glyph for U+0041 A LATIN CAPITAL LETTER A is, of course, similar to 𝔄. Love —LiliCharlie (talk) 10:47, 22 February 2019 (UTC)
- @Prosfilaes:
- I don't see how I'm adding confusion by saying "For code points" and then saying "the character ÷". Saying "For code points" and then saying "the character LATIN CAPITAL LETTER X" is no less guilty of confusing code points with characters. The English letter string 'LATIN CAPITAL LETTER X' is neither a code point nor a character, nor is the glyph '÷'. Both only designate characters. '÷' has the advantage that it does not presuppose any familiarity with Latin or any other well-known script. Further, any reader who is familiar with Latin would take exception to "LATIN CAPITAL LETTER W", an official Unicode name, since Latin did not have a W. Better just to refer to "the capital letter W".
- You write:
- The fact that names are capitalized is something that you learn about Unicode by exposure, and again, for the audience, is something they'll just absorb.
- This is hardly necessary. An encyclopedia is supposed to tell the reader things, not just expose them to usages. Even if this information is added to the article, though, "the character LATIN CAPITAL LETTER X" will strike the reader—strikes me, anyhow—as odd, since a letter string is not a character. Referring to "the English character X", (thereby distinguishing it from the Greek character Χ) would be much better.
- Yes, one expects control characters, but why not something with a name familiar to the typical reader like the carriage return U+000D?
- As you say, a hieroglyph would be preferable to 𣥜.
- @LilliCharlie: Point taken.
- Thousands of Wikipedia articles refer to Unicode characters by their official names in capitalized form. The reason for this is that the names are unique and normatively identify the character referred to. If we were to abandon the official Unicode character names and devise our own names (which would be original research) then there would be endless disputes about the names. You prefer to refer to "X" as "English character X" yet you must know that X is used for hundreds of other languages, so referring to "X" as an "English character" would be totally unacceptable — which is why LATIN CAPTIAL LETTER X is so much better way of referring to the character. BabelStone (talk) 21:42, 22 February 2019 (UTC)
Why is Latin "so much better" than English? Granted, the English and Latin X is also the German and Swedish X, but we need to apply some adjective—Latin, English, German, whatever—to distinguish it from the Greek Χ, which really is a different character. In en.wikipedia.org, the character can be clearly designated as the "English character X". In sv.wikipedia.org, it would be clearer to call it the "Svenska bokstaven X". Neither is "totally unacceptable".
Choosing a locution maximally clear to the expected reader is not original research. It is not research at all. Even misspelling "capital", as you did above, engenders no problem—we all know what you meant.
Peter Brown (talk) 23:36, 22 February 2019 (UTC)
- Latin, especially LATIN, is much better than English, because the English character X seems to label something English-specific, where as Latin is more likely to be taken as referring to Latin script; even if you're not familiar with that phrase, most people should recognize Latin is the ancestor of our script and take it as generic.
- I think the question comes down to learning styles, and while I'm not sure mine is better, I do think it's more encyclopedic to separate levels and talk here about the code-point level and how you write code points, like U+0050, without trying to drag in what the code points mean here. --Prosfilaes (talk) 06:01, 23 February 2019 (UTC)
- This must be a joke. While there are letters of the English alphabet (≈Latin letters regularly used in English) and punctuation marks regularly used in English, there is nothing like an "English X", a "Commonwealth English Æ" (as in encyclopædia) or an "English full stop/English period." The ⟨X⟩ in “Xi'an is beautiful.” is neither a "Chinese Pinyin X" nor an "English X"; it's just the Latin script capital letter X that is a common element of the English, the Chinese Pinyin, the Latin, and many other writing systems. Love —LiliCharlie (talk) 13:34, 23 February 2019 (UTC)
Once again, the wording in question has read:
- For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).
This violates MOS:ALLCAPS, according to one should use capital letters for Unicode names only "when presenting tables of Unicode data, and when discussing code point names as such. Otherwise prefer unstyled, plain-English character names". The passage in question is a discussion of the designation of code points in the 'U+' format, not of code point names as such.
Adopting Prosfilaes suggestion that a hieroglyph be used and acknowledging LiliCharlie's objection to "the English X", I am bringing the passage into accord with the MOS by replacing it with
- For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g. U+0058 for the character 'X' in English and related languages); for code points outside the BMP, five or six digits are used, as required (e.g. U+13254 for the Egyptian hieroglyph '
').
Peter Brown (talk) 19:06, 24 February 2019 (UTC)
What does MOS:ALLCAPS require in Unicode § Architecture and terminology?
In full, the bullet point in MOS:ALLCAPS relevant to Unicode reads:
- The names of Unicode code points are conventionally given in small caps (tip: enter the name in all caps into the template
{{sc2}}
). Example: the character⁓
(U+2053, SWUNG DASH). This is only done when presenting tables of Unicode data, and when discussing code point names as such. Otherwise prefer unstyled, plain-English character names (whether they coincide with code point names or not): the hyphen and the en dash, not the HYPHEN-MINUS and the EN DASH.
The Unicode article currently contains the text:
- For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X)
This is contrary to MOS, as the discussion contains the all-caps text LATIN CAPITAL LETTER but does not present a table and is not about code point names as such but rather about a standard way of designating code points, one that involves hexadecimal digits.
I replaced this with text that does conform to MOS. LiliCharlie has reverted it, restoring the nonconforming code. Though the associated edit summary correctly quotes MOS:ALLCAPS as saying "The names of Unicode code points are conventionally given in small caps", the convention in question is provided by the Unicode Standard and the MOS spells out a different convention to be followed in Wikipedia articles. With some clearly-specified exceptions, we are forbidden to use code-point names in the manner prescribed by the Unicode Standard. Rather, we are instructed to use plain-English character names whether they coincide with code point names or not. Editors are welcome to improve on the phrase I used, "the character 'X' in English and related languages", perhaps referring to the Latin ancestry of the character, but such emendations should still conform to the MOS.
Peter Brown (talk) 21:47, 24 February 2019 (UTC)
- That's clearly wrong, or at best confusing, since the hyphen-minus and the hyphen are two totally different things. In any case, we should not bring in plain English character names because we're not talking about plain English characters. If necessary, I'm fine with removing the names altogether; they're not needed for the example.--Prosfilaes (talk) 01:34, 25 February 2019 (UTC)
- Well, we can use plain English referring expressions, can't we, even we don't call them "names"? And the reader would surely appreciate seeing glyphs to get some idea what we're talking about; these could be put in parentheses. How about the following?
- For code points in the Basic Multilingual Plane (BMP), four digits are used, e.g. U+00F7 for the division sign (÷); for code points outside the BMP, five or six digits are used as required, e.g. U+13254 for the Egyptian hieroglyph designating a winding wall (
).
- For code points in the Basic Multilingual Plane (BMP), four digits are used, e.g. U+00F7 for the division sign (÷); for code points outside the BMP, five or six digits are used as required, e.g. U+13254 for the Egyptian hieroglyph designating a winding wall (
Peter Brown (talk) 19:49, 26 February 2019 (UTC)
Version 12.1: new Japanese era name (2019-05-01)
Version 12.1 adds U+32FF ㋿ SQUARE ERA NAME REIWA "to enable software to be rapidly updated to support the new Japanese era name in calendrical systems and date formatting. The new Japanese era name was officially announced on April 1, 2019, and is effective as of May 1, 2019." [1] -DePiep (talk) 22:08, 9 July 2019 (UTC)