Talk:Plane (Unicode)
![]() | Computing Unassessed | |||||||||
|
Turoslangos is playing games here. Neither UTC nor WG2 will accept Old Hungarian into the BMP. There isn't room, and neither is there justification for encoding it there. -- Evertype·✆ 21:04, 7 November 2008 (UTC)
Plane 16 and "20-bit limit"
Obviously, Plane 16 (100000-10FFFF) is a 21-bit entity (why they crashed thru to Plane 16 with 3-13 unused seems rather inelegant here, but I'm not a Unicode expert. I can, however, decipher hexadecimal. I have no idea how to "improve" ("correct"?) this, but it needs to be done. Grndrush (talk) 17:18, 3 January 2009 (UTC)
- I was about to say much the same. Is the answer to call it a 17-plane limit and ignore the bit-question? Alternatively one could explain that the 20-bit limit is a matter of the address space defined by the available surrogate pairs, and thus defines the number of planes available beyond the BMP. (If I have understood aright…) Ian Spackman (talk) 00:11, 28 July 2009 (UTC)
- 21 bit is just an outcome, it is not the preset limit. Here we go. BMP is defined the full 16 bit (hhhh): 0000-FFFF, ~65000 numbers. (So prefix is 00hhhh so Plane=0). IN this plane are defined 1024 high surrogates and 1024 low surrogates, at D800-DBFF and DC00-DFFF. Surrogates must be used in pairs (one high, one low) to point to a character. So they can identify exactly 1024x1024 ~1M points. Together they need hhhhlow.hhhhhigh is 32 bit. So the 1M points are within the range D800.DC00 - DBFF.DFFF (but not every point in that range).
- In comes UTF-16. UTF-16 recalculates these 32bit numbers 1:1 into the range 10000-10FFFFhex, starting right after plane 0 (at FFFF+1), and exactly filled with the ~1M points, creating planes 1 to 16dec (=the final 10hex). Now there is no unused number any more, and the whole range can be identified with 21 bits.
- So because there are 1024x1024 surrogates defined, the UTF-16 recalculated numbers fit exactly in a 21-bit range. Starting plane 17 at 10FFFF+1=110000 would need a 22nd bit, and cannot be recalculated to the high-low 32bit pair.
- Nowadays the U+hhhhhh notation is used commonly. -DePiep (talk) 17:13, 6 October 2010 (UTC)
- 0xHHHHHHHH 108.71.120.43 (talk) 20:50, 10 October 2016 (UTC)
- 21 bit is just an outcome, it is not the preset limit. Here we go. BMP is defined the full 16 bit (hhhh): 0000-FFFF, ~65000 numbers. (So prefix is 00hhhh so Plane=0). IN this plane are defined 1024 high surrogates and 1024 low surrogates, at D800-DBFF and DC00-DFFF. Surrogates must be used in pairs (one high, one low) to point to a character. So they can identify exactly 1024x1024 ~1M points. Together they need hhhhlow.hhhhhigh is 32 bit. So the 1M points are within the range D800.DC00 - DBFF.DFFF (but not every point in that range).
Typo in "Supplementary Special-purpose Plane" ??
The section "Supplementary Special-purpose Plane" includes the line:
Variation Selectors Supplement (0E0100–E01EF)
That zero in front of the first hex number sure looks wrong to me, but I honestly don't know enough about this topic to know if it serves some actual purpose. Would someone better informed please fix it if it's wrong, or say why it's right?
Private Use Area planes for social networks
I've been finding HTML documents with glyphs for Facebook, Twitter, etc. as Unicode characters in the Private Area Use planes. This requires a custom font. Any references on this? --John Nagle (talk) 20:46, 30 April 2013 (UTC)
- As the definition goes: anyone can publish or use a character definition in PUA space (example: I may have a PUA character to mail to my spouse to say X, and only we two know. We don't see the font, but the char number is enough for us to meet). If FB or TWI does so, it is up to them to provide the font, and to make it work publicly. If they can't get that right, the reader will see the wrong character. Like in the old day: question marks at best.
- Actually, is that so? Examples by FB or TWI? It could be users/companies are useing PUAs (writing on FB or TWI), but then the issue is with these users. -DePiep (talk) 21:08, 30 April 2013 (UTC)
UTF-8 "designed for 2^21 bits"
The UTF-8 coding scheme was designed when Unicode was still contemplating a 31-bit space. It was not "designed" for a limit of 2^21 codepoints, and was eventually restricted to a much smaller number anyway (0x10FFFF). Elphion (talk) 01:13, 3 October 2016 (UTC)
- Why would Unicode modernize a code space by making it smaller? 108.71.123.25 (talk) 16:05, 5 October 2016 (UTC)
- Because otherwise the parties could not agree on a standard. Too many manufacturers were already heavily invested in 16-bit characters. UTF-16 was the compromise that allowed the standard to go forward. When eventually we run out of space (and we will, though computing technology will have changed a lot by the time that happens), larger spaces will be introduced. But they will not be "Unicode". -- Elphion (talk) 16:18, 5 October 2016 (UTC)
- But 0x00E00000 to 0x00FFFFFF and 0x60000000 to 0x7FFFFFFF were assigned! And my flip phone uses such an operating system that uses a 32 bit code space. 108.71.123.25 (talk) 16:21, 5 October 2016 (UTC)
- (see below -- Elphion (talk) 16:23, 5 October 2016 (UTC))
- When I can enter text on my flip phone, a character map with a code point above it is shown. It highlights the space and displays 0x00000020 in the top. This implies that it uses a 32 bit space. 108.71.123.25 (talk) 16:27, 5 October 2016 (UTC)
- (see below -- Elphion (talk) 16:23, 5 October 2016 (UTC))
- But 0x00E00000 to 0x00FFFFFF and 0x60000000 to 0x7FFFFFFF were assigned! And my flip phone uses such an operating system that uses a 32 bit code space. 108.71.123.25 (talk) 16:21, 5 October 2016 (UTC)
- Because otherwise the parties could not agree on a standard. Too many manufacturers were already heavily invested in 16-bit characters. UTF-16 was the compromise that allowed the standard to go forward. When eventually we run out of space (and we will, though computing technology will have changed a lot by the time that happens), larger spaces will be introduced. But they will not be "Unicode". -- Elphion (talk) 16:18, 5 October 2016 (UTC)
0x00E00000 to 0x00FFFFFF/0x60000000 to 0x7FFFFFFF
Some operating systems still have these as private use areas. 108.71.123.25 (talk) 16:07, 5 October 2016 (UTC)
- But those are not Unicode planes, the subject of this article. The Unicode standard sets a maximum of 17 planes. There is nothing to stop people from storing other values in 32 bits, but that's not Unicode. -- Elphion (talk) 16:13, 5 October 2016 (UTC)
- Universal Character Set still has this. Some operating systems still have these. My flip phone has one such operating system that uses UTF-32/UCS-4, and it shows an 8 digit code point. 108.71.123.25 (talk) 16:17, 5 October 2016 (UTC)
- No, UCS was revised to agree with Unicode, for consistency. Whatever your flip phone uses is not Unicode, and not UCS-4, no matter how it might be labeled. -- Elphion (talk) 16:21, 5 October 2016 (UTC)
- When I can enter text, it displays 0x00000021 and highlights the space. This 8 digit code point means that it is a 32 bit code space. 108.71.123.25 (talk) 16:25, 5 October 2016 (UTC)
- As I said, nothing prevents a programmer from storing arbitrary values in 32 bits. That doesn't make them Unicode, which has a very precise and well-documented definition that caps the space at U+10FFFF. The number of leading zeroes shown in the display doesn't alter that. Added: If in fact your phone uses values above U+10FFFF, it was programmed to use a non-standard extension of Unicode, which (since Unicode is capped) is reasonably safe, in the sense that those private characters will never be assigned conflicting Unicode values. But the programmer would have no expectation that the non-standard values would be understood beyond the phone's universe. Such a message sent to another phone from a different manufacturer (or a different revision level) likely won't display as intended. -- Elphion (talk) 16:57, 5 October 2016 (UTC)
- I scrolled through the characters. The map starts at 0x00000020 and ends at 0x0002FA1D. 108.71.123.25 (talk) 17:41, 5 October 2016 (UTC)
- Regardless of what encoding scheme you phone uses, your changes will be reverted because they contradict the actual Unicode Standard and that's what this article is about. See chapter 2.4 of the Standard:
In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF, comprising 1,114,112 code points available for assigning the repertoire of abstract characters.
- Anything outside of that codespace isn't Unicode and isn't relevant to this article. DRMcCreedy (talk) 18:02, 5 October 2016 (UTC)
- In the "Help" display for entering characters, it says "...to select the UTF-32/UCS-4 character..." 108.66.233.59 (talk) 18:04, 5 October 2016 (UTC)
- And, according to your own experiment, it does not go beyond the Unicode space: it stops at U+2FA1D, which is well below U+10FFFF. So although your phone is using a 32-bit display (or a 31-bit display, it's hard to tell when the highest digit is 0), it is only dealing in characters within the Unicode space. -- Elphion (talk) 18:21, 5 October 2016 (UTC)
- It's a 32 bit display. 108.66.233.59 (talk) 18:34, 5 October 2016 (UTC)
- Also, if I click this link on my flip phone, and I enter a number higher than 0x0010FFFF, it displays a box with the code point in it. For example, if I enter 0x60000000, it displays this:
- +----+
- |6000|
- |0000|
- +----+
- It's a 32 bit display. 108.66.233.59 (talk) 18:34, 5 October 2016 (UTC)
- And, according to your own experiment, it does not go beyond the Unicode space: it stops at U+2FA1D, which is well below U+10FFFF. So although your phone is using a 32-bit display (or a 31-bit display, it's hard to tell when the highest digit is 0), it is only dealing in characters within the Unicode space. -- Elphion (talk) 18:21, 5 October 2016 (UTC)
- In the "Help" display for entering characters, it says "...to select the UTF-32/UCS-4 character..." 108.66.233.59 (talk) 18:04, 5 October 2016 (UTC)
- Regardless of what encoding scheme you phone uses, your changes will be reverted because they contradict the actual Unicode Standard and that's what this article is about. See chapter 2.4 of the Standard:
- I scrolled through the characters. The map starts at 0x00000020 and ends at 0x0002FA1D. 108.71.123.25 (talk) 17:41, 5 October 2016 (UTC)
- As I said, nothing prevents a programmer from storing arbitrary values in 32 bits. That doesn't make them Unicode, which has a very precise and well-documented definition that caps the space at U+10FFFF. The number of leading zeroes shown in the display doesn't alter that. Added: If in fact your phone uses values above U+10FFFF, it was programmed to use a non-standard extension of Unicode, which (since Unicode is capped) is reasonably safe, in the sense that those private characters will never be assigned conflicting Unicode values. But the programmer would have no expectation that the non-standard values would be understood beyond the phone's universe. Such a message sent to another phone from a different manufacturer (or a different revision level) likely won't display as intended. -- Elphion (talk) 16:57, 5 October 2016 (UTC)
- When I can enter text, it displays 0x00000021 and highlights the space. This 8 digit code point means that it is a 32 bit code space. 108.71.123.25 (talk) 16:25, 5 October 2016 (UTC)
- No, UCS was revised to agree with Unicode, for consistency. Whatever your flip phone uses is not Unicode, and not UCS-4, no matter how it might be labeled. -- Elphion (talk) 16:21, 5 October 2016 (UTC)
- Universal Character Set still has this. Some operating systems still have these. My flip phone has one such operating system that uses UTF-32/UCS-4, and it shows an 8 digit code point. 108.71.123.25 (talk) 16:17, 5 October 2016 (UTC)
108.71.120.43 (talk) 22:29, 10 October 2016 (UTC)
And that shows nothing, except that your cellphone and the internet app do not screen out non-standard input. Neither your cellphone nor the app at unicodelookup.com constitutes a WP:RS. As we have all been telling you, the standard is quite clear: there are no valid code points above U+10FFFF. -- Elphion (talk) 22:54, 10 October 2016 (UTC)
- If I click the same link on my Windows computer or android phone, the site says undefined which is what it is. There is in my assessment too little unused space in BMP to be able to extend UTF-16 into 6 bytes. Otherwise a new type of high surrogate could be allocated as the first of 3 16-bit words.--BIL (talk) 14:59, 8 January 2017 (UTC)
Math error, request confirmation/correction by Unicode standards expert.
When I tally the total nubmer of Code Points available from the three Private Use ranges I get four (4) more than is indicated by the summary at the top of this article.
- par.3: .... 137,468 are reserved for private use, leaving 974,530 for public assignment. - par.4: .... 65,536 code points (Supplementary Private Use Area-A and -B, which constitute the entirety of planes 15 and 16).
Basic Multilingual Plane: - par.4: As of Unicode 12.1, the BMP comprises the following 163 blocks: o .... o Private Use Area (E000–F8FF)
F8FFhex 63743 end of BMP Private Use Block -DFFFhex 57343 end of preceeding Surrogate Block ============= 1900hex 6400 code points in BMP Private Use Block
6,400 Private Use Block in Unicode Plane 0 (BMP) + 65,536 Private Use Block in Unicode Plane 15 (PUA-A) + 65,536 Private Use Block in Unicode Plane 16 (PUA-B) ======== 137,472 tally of the three (3) Private Use Blocks
137,472 tally of the three (3) Private Use Blocks -137,468 code points referenced in introduction to this article ======== 4 less code points in Intro than calculated from tallies of the 3 Blocks
Tree4rest (talk) 23:46, 24 September 2019 (UTC)
- Is this possibly caused by the xxFFFE and xxFFFF code points in the PUA planes?Spitzak (talk) 23:54, 24 September 2019 (UTC)
- Yes. Although each plane has 65,536 = 2^16 code points, the last two in each plane are permanently declared non-characters. So only 65,534 are available for (any) use in planes 15 and 16. -- Elphion (talk) 00:35, 25 September 2019 (UTC)
- Actually, even if the last two characters in each planes are declared "non-characters", they are valid codepoints and can be encoded, say with UTF-8, even if the encoded text is non-conforming. The same is true for the few non-characters assigned inside the Arabic forms near the end of the BMP. Being "non-characters" means that they are not useful for encoding text for interchange, but they can still be used *locally* as special-purpose marks inside applications, or libraries, or renderers, to facilitate their implementation (and they are used for that: on input texts are filtered and either non-characters may be filtered out, or the whole document would be rejected as invalid, or they could be replaced by a placefolder; but internally, they can then be freely used for the implementation that should then still not emit transformed texts containing them because these texts would become fully rejected by the recipient).
- Those non-characters have then NO meaning (like PUA) but more restricted than PUA because their interfhange in conforming text documents is invalid (for example the non-characters must NOT be present in documents conforming to standards like HTML or XML or JSON. And varoous applications or libraries will reject them if they ever detect them: for example a filesystem API may detect an encoding error and filesystem inconsistancy, or desynchrinization problems, or data corruption in the media, and the filesystem could refuse to mount such filesystem and won't grant any write access without specific permission: a special maintenance will be needed, that cannot be automated as it could cause security issues or corruption of important data which is not supposed to be text, and could be an encrypted binary file - "repairing" the filesystem by replacing/dropping those characters could damage the data or invalidate its binary signature)
- Non-characters are very useful as they can be used to detect corruptions, or access violations, or failure in communication or storage protocols: they can then be used as guards (notably the last two codepoints at end of each plane), for example to create a binary container formats multiplexing text parts and binary parts, all with variable nelgth (e.g. inside encoded video streams like audio/video/image formats, including JPEG, MPEG, PNG, Webm, Ogg and others where text framents may be present for tagging metadata, or subtitles, or titles, or licensing and copyright statements, or to embed URIs or HTML, XML and JSON documents). There are not many non-characters, but still they are valid codepoints (meaning that they can be transformed bijectively between all conforming UTFs; It is not the case for surrogates that don't have this bijective capability, so there's no roundtrip conversion (the roundtrip does not work with two successive surrogates, it only works with isolated surrogates, which are still forbidden in conforming texts: surrogates do not have any value even if they have a codepoint assigned to them, only to implement UTF-16; if UTF-16 was not part of the standard, there would be NO surrogate at all in the BMP, but there would still remain non-characters). verdy_p (talk) 02:59, 17 October 2020 (UTC)
- The artical count is correct. The two non-characters at the end of each plane are specifically excluded from PUA-A and PUA-B per The Unicode Standard (https://www.unicode.org/versions/Unicode13.0.0/ch23.pdf#G19378). DRMcCreedy (talk) 03:43, 17 October 2020 (UTC)