Talk:Extended ASCII/Archive 1
![]() | This is an archive of past discussions about Extended ASCII. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 |
How are They Stored?
If you have a string of characters, how does one differentiate between two different code pages? I'd imagine that some form of escape character would be used. For example, if I'm writing using ISO-88591 and want to switch over to greek, what is the internal change, on non-unicode terminals? 66.190.72.225 07:23, 26 November 2005 (UTC)
- In general, the only way to conclusively know the encoding of a sequence of bytes is with some out of band meta-information. For example, MIME and HTTP use the Content-Type header to specify the character set and encoding.
- For your second question, it sounds like you want to mix characters from different character sets. That is difficult or impossible to do without Unicode. There are some encoding formats, such as ISO-2022 that permit mixing encodings, but Unicode is generally the best choice. 216.113.168.128 20:04, 21 December 2005 (UTC)
- Some programs like Microsoft Internet Explorer try to "guess" the code page based on how the text looks like in each code page 84.58.183.1 21:27, 25 April 2007 (UTC)
ANSI or ASCII?
What is the difference between ANSI and Extended ASCII?
Is there a difference, or are they just the flip sides of the same thing? 216.99.201.109 (talk) 20:38, 30 October 2009 (UTC)
- ANSI's not a character set (it's used to refer to the ANSI escape sequences). Tedickey (talk) 21:44, 30 October 2009 (UTC)
- ‘ANSI’ as used in Microsoft Windows programming means any character encoding other than Unicode (the exact meaning depending on the codepage setting and particular language version of Windows in use). A number of APIs have both A (‘ANSI’) and W (wide, i.e. Unicode) versions of functions.
- I guess this is an example of double metonymy: ‘ANSI’ being used not to refer to just one of the organization’s standards (as with the X3.64 escape sequences mentioned by Mr Dickey) but to the whole kind of character encoding they might standardize (i.e. byte-level).
- (Note that although the original 7-bit ASCII was an ANSI standard (X3.4), it’s unlikely to be referred to as ANSI: the ASCII name is so much better known.)
- So to answer your question, they’re both vague terms that can mean vaguely the same thing. --82.46.154.229 (talk) 18:59, 3 April 2011 (UTC)
- If you're going to cite Windows, a Microsoft webpage would be appropriate. Googling on "windows ansi encoding" isn't showing me anything appropriate, merely the usual uninformed comments (such as Wikipedia). TEDickey (talk) 20:47, 3 April 2011 (UTC)
- Indeed. See Code Pages (Windows) and Unicode and Windows XP [PDF], which additionally give the origin of the term from the ANSI draft that became ISO 8859-1. --82.46.154.229 (talk) 02:57, 5 April 2011 (UTC)
- That's addressing part of the comments above: this source equates "ANSI" with CP-1252, but doesn't generalize to the other code pages which are supported in Windows TEDickey (talk) 10:00, 5 April 2011 (UTC)
- FYI, Windows_ANSI_code_page#ANSI_code_page deals with this. — Preceding unsigned comment added by 86.75.160.141 (talk) 20:22, 29 October 2012 (UTC)
- I see that it talks about it, but also see that it adds comments which are not found in the given sources (seems that some editors provided their own story). Using Wikipedia instead of a reliable source isn't conducive to a discussion TEDickey (talk) 23:42, 29 October 2012 (UTC)
- You are right that wikipedia is not a correct reference, but the above page provide this reference http://msdn.microsoft.com/en-us/goglobal/bb964658.aspx#a where MSDN explain why ANSI is a misnommer. 86.75.160.141 (talk) 16:08, 1 November 2012 (UTC)
520256644 identified as vandalism
Variations and extensions
As hundreds or thouthands standards do exist with many variations and common part from one to one other it is difficult to have an overview of each relationship. Next table give an illustration of how ASCII and ASCII extended and variants, have central influence in technology evolutions.
Telegraphy | Telephony | Computing | Aviation | ||||
---|---|---|---|---|---|---|---|
Original Baudot code (International alphabet n°1) | |||||||
↓ | |||||||
International alphabet n°2 (IA n°2) | |||||||
Variants of EBCDIC and other character encodings | |||||||
⇟ | |||||||
ISO 646 - IRV (international reference variant) | Arinc | ||||||
↓ | ↓ | ↓ | ↓ | ↓ | |||
ISO 646 - US (United States) | |||||||
ISO 646: Other countries | |||||||
↓ | |||||||
ISO 646: Other countries | |||||||
↓ | ↓ | ↓ | ↓ | ↓ | |||
ASCII | Code page DOS (437, 850, ...) | ISO 8859 series (for example ISO 8859-1, ISO 8859-15) | ISO 2022 (supports more than 256 characters) | ↓ | |||
⇟ | ↓ | ||||||
Windows code page such as Windows-1252 (or Ansinew) | ↓ | ||||||
↓ | ↓ | ↓ | ↓ | ↓ | |||
ISO 10646 / Unicode | |||||||
↓ | |||||||
IA n°5 | GSM 03.38 (SMS) |
Légende:
Légende | |
ASCII | ASCII standard or standards very close from ASCII |
Extended ASCII | Add additional characters to ASCII ones. |
Extended ASCII | Add additional characters to ASCII ones, but ASCII bytes may represent other characters depenfing on context |
ASCII variants | Mostly ASCII, but with some code points representing different characters |
ASCII subset | Mostly ASCII, but with some code points reserved for national variants |
Unrelated to ASCII | Not related to ASCII |
⇟ | New encoding; no conservation of previous set of characters |
⇣ | New encoding providing the set of characters yet avilable in the previous one |
⇣⇟⇓⇩⥕⥥⟱⤋⬇↡
- The Windows code pages could be considered ISO-8859-x with additional characters, rather than a re-encoding of the IBM code pages.Spitzak (talk) 02:28, 6 August 2014 (UTC)
No sources?
There are no sources given for "Extended ASCII" - one source is 404, the other two sources literally say there is no such thing as "Extended ASCII". At the moment, it seems wikipedia is the original source for this. 109.193.248.102 (talk) 23:45, 17 May 2022 (UTC)
- If you're referring to the first three sources in the article, then 1) I updated the link to the Oracle forum posting to its current location, so no more 404s (and that one had an archive link that worked), and 2) all three of the comments are part of threads (mail, forum, or USENET) that speak of "extended ASCII" but all the comments say "don't use that term". This should not be surprising, given that they're used as a reference for the claim that "Using the term "extended ASCII" on its own is sometimes criticized...".
- Given that you also proposed deleting the page, I see two questions here:
- 1) Should the term "extended ASCII" be used? On the one hand, the use of that term in the thread indicates that there are people who use it; on the other hand, the comments in the thread indicate that there are arguments against its use.
- 2) Does the concept of "character sets that encode characters as sequences of 8-bit bytes, and in which the characters in ASCII are encoded as a single 8-bit byte whose value is the code point for the character, and in which characters not in ASCII are encoded sequences of one or more 8-bit bytes in which the first byte has the uppermost bit set" deserve a Wikipedia page?
- I don't see that a "no" answer to the first question requires a "no" answer to the second question:
- The first reference speaks of "8-bit extensions of ASCII", by which I suspect they mean "character sets that encode characters as sequences of 8-bit bytes, and in which the characters in ASCII are encoded as a single 8-bit byte whose value is the code point for the character, and in which characters not in ASCII are encoded as an 8-bit byte with the uppermost bit set", so ISO 8859/1 is an "8-bit extension of ASCII" but various Extended Unix Code (EUC) encodings, and UTF-8, aren't.
- The second reference speaks of encodings of the sort I describe, as well as of UTF-16, which uses ASCII code points to represent ASCII characters, but doesn't encode them as single 8-bit bytes.
- The third reference speaks of "many, many, many different character sets designed such that ASCII is a subset of them", saying that "These may logically be regarded as extensions to ASCII, but you can't point to any one of them and say "that's Extended ASCII"."
- so they all acknowledge existence of the concept of character sets that extend ASCII by adding new characters.
- I think the general concept is useful, and its existence is acknowledged by the three people complaining about the term "extended ASCII", so I don't think the article should be deleted; instead, the page should remain, with a new title. Guy Harris (talk) 00:41, 18 May 2022 (UTC)
Charset table as ASCII
I removed the following section. See below for rationale. --Pjacobi 10:51, September 5, 2005 (UTC)
- Extended ASCII Table
- Extended ASCII uses 8 digits of 1's and 0's for a total of 256 characters. The first 32 however cannot be shown as they are special control sequences and thus they cannot be printed. Also the two blanks are space characters.
IMHO this section is a bad idea for two reasons:
- Giving some specific charset, whereas the article correctly states, that there are plenty.
- Using graphics for a text table
10:51, September 5, 2005 (UTC)
- I agree on the second reson just some of the charecters cant be displayed on the web as far as I know If anyone knows a way a regular table would be nicer. However on the first part ASCII has a table plus I believe a table could be beneficial and a major point of interest it can teach you a fair amount about computers. Or maybe a text table of 00100001 through 01111110 (! through ~ on the chart) as they are the most common used? or just 0-9 and A-Z? --Shimonnyman 11:28, 5 September 2005 (UTC)
- There is an ASCII table in ASCII.
- Extended ASCII tables are in ISO 8859-1, ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10, ISO 8859-11, ISO 8859-13, ISO 8859-14, ISO 8859-15, ISO 8859-16, Code page 437, Code page 850, Code page 858, KOI8-R, KOI8-U, TSCII, Mac-Roman encoding, Kamenicky encoding (list is incomplete).
- Pjacobi 13:19, September 5, 2005 (UTC)
Those arent Extended ASCII binary tables, thats what i was refering to thinking could help but the ASCII table on the ASCII page appears to be extended ASCII (a bit incomplete) I didnt check every code but every one i checked was extended ASCII and they arent the same in both because well obviously 7-bit and 8-bit arent going to look identicle. So maybe it belongs here, I dont know anyways just observing. — Preceding unsigned comment added by Shimonnyman (talk • contribs) 23:52, 5 September 2005 (UTC)
- "A table" of "extended ASCII" would be a table showing ASCII plus a bunch of "available for use when encoding other characters" slots; it cannot show any characters other than those in ASCII, because different extensions of ASCII have different characters, and thus would have different tables. That's exactly what User:Pjacobi said. Guy Harris (talk) 00:57, 18 May 2022 (UTC)
Old Layout this Article Only
It appears that this article (Extended ASCII) has the old Wikipedia layout and not the new one launched in January of 2023.
Is there a way to fix it? Ducktapeonmydesk (talk) 21:37, 26 January 2023 (UTC)
- There might have been a cached copy on some Wikimedia server, and your two edits might have caused the cached copy to be flushed. There's also a "Purge" menu item that will purge cached copies of the article; it's under "Tools" in the new skin, and in whatever drop-down list contained "Move" in the old skin. Guy Harris (talk) 22:11, 26 January 2023 (UTC)
Compatibility with UTF-8
The final sentence of this article states:
- A computer language that supports Extended ASCII can also support UTF-8 without any changes; this was a major factor in UTF-8's popularity.
Which, I suppose, is technically true. A few years back I had the unpleasant task of converting a large enterprise system from ISO 8859-1 to UTF-8 (although in practice it was Windows 1252 since those additional characters were present despite the database being declared as 8859-1) It was difficult, time consuming, and expensive.
It is true that none of the computer languages needed to be changed. C, C++, C#, Java, Javascript, ASP, SQL, Pl-SQL, etc. needed no modifications. But every single "extended" character that took up one byte in 8859 needed two bytes in UTF-8 causing all sorts of sizing issues. The sentence above could be very misleading - I can imagine one of my managers (who has never coded anything themselves) reading this and thinking the systems are backward compatible. Perhaps if there was a cite for it we could expand or clarify the sentence. As it stands, the best thing would be just to remove it.
Mr. Swordfish (talk) 13:16, 18 September 2023 (UTC)
- Yes, I agree, just delete it.
- TBH, I came very close to deleting the whole section as I can see no redeeming features. For now, I've tagged it as WP:OR but unless someone does a major cleanup and sourcing job on it real so, off with its head. --𝕁𝕄𝔽 (talk) 13:46, 18 September 2023 (UTC)
- Thanks. I'm new to this page so I didn't want charge in and make sweeping changes without asking first. I've removed the sentence in question.
- As for the rest of the section, I don't think it adds much, and if your C or C++ code uses any fixed length character arrays there's a lot of maintenance coding to deal with single byte characters turning into multi-byte characters when converting from 8859 to UTF-8, in sharp contrast to the assertion of little extra programming effort.
- I'd say just delete the section. Mr. Swordfish (talk) 15:21, 18 September 2023 (UTC)
- That said, a short treatment of how 8859 is basically not compatible with UTF-8 might be worth including if someone wants to write it. Mr. Swordfish (talk) 15:24, 18 September 2023 (UTC)
- As you stated, computer languages did not have to change. This is a big deal, making switching to UTF-8 from extended ascii much easier than other possible switching. Also even at that time there was lots of software that only dealt with character strings, not individual characters, and that also needed no changes.Spitzak (talk) 00:39, 19 September 2023 (UTC)
- The first question that comes to mind is "what does it mean for a computer language to "support" extended ASCII or UTF-8?" Does it mean:
- extended-ASCII comments will not be rejected by programs that process that language?
- character string constants in the language can contain extended ASCII, and octets in the string that aren't ASCII characters will be inserted into the string as is?
- identifiers in the lannguage can contain extended ASCII characters?
- the language's support for character strings handles strings containing extended-ASCII characters?
- Something else? Guy Harris (talk) 00:58, 19 September 2023 (UTC)
- It means that strings can contain all byte values with the high bit set, and printing the string prints the same byte with the high bit set that is in the source code. Spitzak (talk) 04:28, 19 September 2023 (UTC)
- Does it also mean that, for example, if the language offers a "convert string to lower case" operation (either as a library routine or as something defined in the language's grammar), it will properly convert strings if the encoding is known, and that other string-processing operations deal with all supported encodings, including multi-byte ones? If not, then you don't get full support for non-ASCII text for free.
- (And there's the separate question of whether the compiler, if it indicates errors in the source code with, for example, a ^ or characters pointing to the error, correctly understands that, even with a fixed-width character display, there isn't a one-to-one correspondence between octets and character positions.) Guy Harris (talk) 20:23, 19 September 2023 (UTC)
- >...switching to UTF-8 from extended ascii much easier than other possible switching...
- Could you elaborate on what you mean by "other possible switching"?
- Was anybody still using EBCDIC or or any of the other proprietary character sets from the sixties by the time UTF-8 came along?
- ASCII -> UTF-8 conversion is trivial since ASCII is identical to UTF-8 as long as only ASCII characters are used. 8859 -> UTF-8 is not trivial. Or easy. Mr. Swordfish (talk) 21:37, 19 September 2023 (UTC)
- The only thing that's "easy" is code that handles "extended ASCII" in the sense of "strings are a combination of ASCII characters and arbitrary uninterpreted bytes with the 8th bit set". Once you care what those 8th-bit-set bytes represent, you're dealing with the encoding, and you have to worry about the n in ISO 8859-n, at minimum. Maybe the locale makes that work if you're not doing anything too fancy. And if you have to worry about multi-byte character encodings, dealing with the encoding gets harder, as in "going from single-byte encodings to UTF-8 isn't trivial". Guy Harris (talk) 22:46, 19 September 2023 (UTC)
- This isn't rocket science. Printf "works" in UTF-8 because it only looks for '%' characters in the string, which have the exact same byte value in both ASCII and UTF-8, and otherwise prints all the other bytes unchanged. If the thing it is printing on understands UTF-8 then UTF-8 in the printf string will be interpreted correctly. Obviously any code that actually cares about which non-ASCII characters are in use will need to be changed, but the VAST MAJORITY of code does not care and does not need to be changed!Spitzak (talk) 22:59, 19 September 2023 (UTC)
- Right. Not rocket science. All you have to do is examine every character field in your thousands of database tables, look at how many bytes are allocated, look at the several million records that use those fields and see which ones are not going to fit anymore when you convert to a multi-byte character set.
- Then, look at all the code, which might include Java, C, C++, C#, T-SQL, etc and make sure that there are no assumptions about string lengths that will blow up when the strings get longer due to multi-byte encoding.
- And if you have web forms, or some other UI that exposes textboxen of a fixed length, they might need to be updated too.
- Nothing hard, just time consuming and tedious.
- All that said, the section in question seems to be WP:OR so let's either get some cites or nuke the section. Mr. Swordfish (talk) 23:21, 19 September 2023 (UTC)
- It's not changing the encoding. The input is UTF-8 and the output is UTF-8, and it does not change size. If you are measuring the string as anything other than bytes then you have much more serious problems than dealing with encodings. Spitzak (talk) 23:39, 19 September 2023 (UTC)
- Obviously if you change the encoding you need to change all the string constants to the new encoding. However at least you don't have to change the compiler, which is the whole point of this section! And if your code is such that changing the length of a string constant will cause it to not work, well all I can say is that I am sorry about your lack of programming skills. Spitzak (talk) 23:40, 19 September 2023 (UTC)
- Instead of insulting my "lack of programming skills" you might try finding some sourcing for this section. Per Wikipedia policy, unsourced material gets removed. Mr. Swordfish (talk) 14:29, 20 September 2023 (UTC)
- Obviously if you change the encoding you need to change all the string constants to the new encoding. However at least you don't have to change the compiler, which is the whole point of this section! And if your code is such that changing the length of a string constant will cause it to not work, well all I can say is that I am sorry about your lack of programming skills. Spitzak (talk) 23:40, 19 September 2023 (UTC)
- It's not changing the encoding. The input is UTF-8 and the output is UTF-8, and it does not change size. If you are measuring the string as anything other than bytes then you have much more serious problems than dealing with encodings. Spitzak (talk) 23:39, 19 September 2023 (UTC)
- This isn't rocket science. Printf "works" in UTF-8 because it only looks for '%' characters in the string, which have the exact same byte value in both ASCII and UTF-8, and otherwise prints all the other bytes unchanged. If the thing it is printing on understands UTF-8 then UTF-8 in the printf string will be interpreted correctly. Obviously any code that actually cares about which non-ASCII characters are in use will need to be changed, but the VAST MAJORITY of code does not care and does not need to be changed!Spitzak (talk) 22:59, 19 September 2023 (UTC)
- The only thing that's "easy" is code that handles "extended ASCII" in the sense of "strings are a combination of ASCII characters and arbitrary uninterpreted bytes with the 8th bit set". Once you care what those 8th-bit-set bytes represent, you're dealing with the encoding, and you have to worry about the n in ISO 8859-n, at minimum. Maybe the locale makes that work if you're not doing anything too fancy. And if you have to worry about multi-byte character encodings, dealing with the encoding gets harder, as in "going from single-byte encodings to UTF-8 isn't trivial". Guy Harris (talk) 22:46, 19 September 2023 (UTC)
- The first question that comes to mind is "what does it mean for a computer language to "support" extended ASCII or UTF-8?" Does it mean: