Jump to content

Talk:Character encoding

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Chiccodoro (talk | contribs) at 08:45, 23 April 2014 (merge: oppose). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Archived discussion

Drop 'Popular character encodings' section?

With the link to Category:Character sets added, I'm wondering whether Popular character encodings should be dropped (or shortened to the really popular ones)? As of today I wasn't bold enough to go forward. Comments? Pjacobi 12:17, 10 Jul 2004 (UTC)

I'd say no. Categories have their uses, but the only way they can order their contents is alphabetically. I'd prefer not to abolish existing collections such as the one in this article in favour of categories. -- pne 10:55, 12 Jul 2004 (UTC)

"Code page" versus "Codepage"

Codepage redirects to Character encoding, but Code page gives the page on vendor specific code pages. Am I the only one puzzled about this? Pjacobi 12:20, 10 Jul 2004 (UTC)

Somebody appears to have fixed this now: Codepage redirects to Code page. -- pne 10:57, 12 Jul 2004 (UTC)


Article and category title ambiguity

What's the difference between a "character set" and a "character encoding" and a "text encoding" ? They deal with assigning a unique integer to each character. I suspect the difference is so subtle that we might as well merge Category:Text encodings and Category:Character sets into one category. OK? --DavidCary 15:53, 18 Jun 2005 (UTC)

The separateness of article text encoding seems of dubious value to me.
For a very detailed discussion of the terminology see http://www.unicode.org/reports/tr17/
Pjacobi 21:57, 2005 Jun 18 (UTC)
I've gone ahead and merged the text encoding article's intro paragraph with this article's intro paragraph, and replaced the text encoding article with a redirect to this one. I have also nominated Category:Text encodings for deletion.
I've gone through most of the character encoding articles and have a new scheme for their categorization in mind. Category:Character sets can stay, but it will be a subcategory of a new overarching Category:Character encoding. — mjb 28 June 2005 04:32 (UTC)
Thanks Pjacobi. http://www.unicode.org/reports/tr17/ is an excellent resource. Everyone editing this article should read it! --Negrulio (talk) 12:06, 1 November 2008 (UTC)[reply]

I suggest separating character encoding and character sets into two articles. Character sets need not use in computers. As you may know, Chinese and Japanese are composed of characters (not words). A group of characters form a character set, for example, character sets used in primary and secondary school education. (See Kyōiku kanji, Jōyō kanji, Jinmeiyo kanji for Japanese usage, and 現代漢語常用字表, 現代漢語通用字表, 常用國字標準字體表, 次常用國字標準字體表, 常用字字形表 for Chinese usage.)

Besides, character sets do not equal to character encoding because one character set can apply several character encodings. For example, latin letters encoded in ASCII or encoded in EBCDIC; or JIS X 0208 (a Japanese Kanzi set) encoded in EUC-JP or in Shift_JIS. --Hello World! 02:57, 27 September 2005 (UTC)[reply]

Unicode and the ISO and IEC have standardized terminology for such things. The "character set", as in "a set of characters", that you are talking about is officially termed a character repertoire (for which there is no need for a separate article and thus no need to disambiguate it from character encoding; at most it could just be more clearly described in the character encoding article). The term character set is acknowledged only as an overloaded, much-abused, legacy term most often referring to what they now prefer to call either a coded character set (a repertoire of characters mapped to numbers) or a character map (a repertoire of characters mapped to specific byte sequences), or occasionally a character encoding scheme (a map or method of converting a character encoding form (don't ask) to specific byte sequences. For more info, Unicode Technical Report #17 is a good reference.
Also, I asked elsewhere about how the term "character" is used in the study of written languages, as opposed to in computing, and it turns out that it's actually used to describe only certain kinds of graphemes used by certain written languages (a subset of Chinese logograms, IIRC). So your examples of other possible definitions of "character set" are in error. I think it's best to be very careful about preserving the distinction between a grapheme, the type of grapheme that is a 'character' according to scholars of written language, and the arbitrary abstraction that is a character (computing). "Character set", "character encoding" and other terms derived from the latter should be kept within the domain of computing related articles. — mjb 06:12, 27 September 2005 (UTC)[reply]

Which article in English wikipedia talks about character repertoire? --Hello World! 14:47, 3 October 2005 (UTC)[reply]

It should be in this one and in the Unicode article. Clearly, there's work to be done :)mjb 18:07, 3 October 2005 (UTC)[reply]

thinking of a rearrangement

This article seems to be written as if the primary meaning of character encoding is "coded character set" whereas it seems to be far more often used to mean "complete process of encoding characters into a stream of code units". Plugwash 12:15, 16 January 2006 (UTC)[reply]

Braille 'the world's first binary character encoding'?

There is a discussion going on in Talk:Braille on the history of (binary) character encodings. This article is far better a place for the history (and the associated discussion). I Started a section on history. This should be expanded. -- Petri Krohn 00:26, 22 April 2006 (UTC)[reply]

Are I Ching, geomantic figures and Braille "character encodings"?

According to the definition given in this article, I Ching and geomantic figures aren't character encodings. They don't represent a sequence of characters, but symbolize crucial philosophical concepts; they don't aim to facilitate computer storage or telecommunication, but divination. According to that latter argument, Braille isn't a character encoding either (it's just a plain code). Therefore, I'm intending to remove the new history section. ― j. 'mach' wust | 19:47, 22 April 2006 (UTC)[reply]

Usage

Where's information about the usage of Unicode in Wiki articles?

Simon de Danser 14:08, 13 January 2007 (UTC)[reply]

ISO-8859-16

I just added ISO-8859-16 to the list of ISO character sets but it was reverted by the anti-vandal bot. How stupid... Maybe someone will know why it was classified as vandalism and how to add it. 01:04, 2 April 2007 (UTC)

Byte order mark

Some mention should be made of the BOM (byte order mark) in files with various encodings. The BOM is described here. SharkD (talk) 02:07, 25 January 2008 (UTC)[reply]

History

I'd be interested in seeing a proper history of character encodings. What came before ASCII? What was the first character encoding used on a computer?

-- TimNelson (talk) 07:18, 10 May 2008 (UTC)[reply]

encoding in programming language?

If I understand well the article (bravo for the clear explaination of meaningful distinctions in section "unicode..."), a programming language adds a level of encoding on top of the whole mess: how characters are internally encoded meaning what is actually a character/string object inside? This will indeed affect how they are manipulated, how easily a given operation is performed, both for the language itself and for the programmar if the interface is not transparent (a dream in python, even python 3).

We could call this a PL-specific "character (or text) format" to avoid ambiguity with the previous 4 concepts: character~grapheme, ordinal ("code point" in unicode), code unit(s) (abstract byte or word values), (concrete) bytestring.

I guess in python by default the representation is close if not equal to utf8. But people told me one can build python with an alternate string format, close instead to strings of unicode ordinals, I guess in fact it is similar or equal to utf32/UCS4. I also read somewhere common C implementations use 32-bit representation of chars.

Note that it's rather complicated because texts to be representated (by string data in memory at runtime) come from:

  • literal strings in source code
  • various forms of computations which result in strings
  • user direct input
  • files in local file system, files over all kinds of networks
  • other...

How does a language, how does a programmer, guess the original encoding (concrete scheme)? A real mystery for me...

What about a section on this topic? Searched for info in WP, couldn't find anything. Pointers (to WP or elsewhere)? Well, after some reflexion, I guess this would be worth a separate article, so much complicated the topic is. But it would certainly be hard to find references to point to, and avoid the content to be so-called "original research". Except maybe for some (unreadible for humans) docs & (hardly usable) tools provided by the unicode technocracy itself.

--86.205.134.247 (talk) 08:01, 31 October 2009 (UTC)[reply]

You would have to read documentation for the specific language you are interested in. Most older languages (e.g. C) are encoding agnostic, as far as they are concerned strings are sequences of bytes and beyond alocating a few values and sequences from the ascii range (and therefore common to all encodings in wide use today) special meaning they don't care what the bytes mean (C did later gain support for widechar strings but i'm not sure if and how the operation of widechar constants is standardised). JAVA uses UTF-16 strings and there is an option on the compiler command line to tell it the encoding of source files. I've never used python so I can't comment on the situation there. Plugwash (talk) 01:46, 3 November 2009 (UTC)[reply]

character encoding and text encoding

the redirection of 'text encoding' to this article is misleading and from the point of view of a specialist in digital humanities just wrong. 'character encoding' refers to the encoding of single characters as part of some character stream. 'text encoding' on the other hand refers to encoding schemas which allow to markup specific features of a text like structural divisions, layout information, linguistic analysis etc. -> markup language Probably it would be the best to refer text encoding to markup language as long as there is no dedicated article on this topic. —Preceding unsigned comment added by 80.128.52.133 (talk) 10:49, 3 July 2010 (UTC)[reply]

Other names?

Are "codeset", "code set", "character coding" possible synonyms? —Preceding unsigned comment added by 193.144.63.50 (talk) 14:57, 2 December 2010 (UTC)[reply]

Recent rewrite of intro

CecilWard's recent rewrite of the lead redefined codes as being the same thing as "numbers". That's not correct. The pre-UCS/Unicode encodings generally mapped characters to bit or byte sequences. Morse Code, for example, maps them directly to electrical pulses. There are no numbers involved whatsoever. The old lead was very stable and, I believe, correct. If it is too technical or confusing to a newbie, we can work on that, but for now I'm reverting the change and am inviting discussion: what's confusing about its current form? —mjb (talk) 01:21, 18 August 2011 (UTC)[reply]

There is a code link, and it is the right thing. I see no need to change something here; it would be better to refine the "code" article to make a better overview of various ways of encoding. Incnis Mrsi (talk) 17:06, 19 August 2011 (UTC)[reply]

Proposed merge from Special characters

There's a pair of tags proposing merger of the article Special characters to this article. This would be a place to discuss this merge.--Wtshymanski (talk) 21:30, 1 February 2012 (UTC)[reply]

Merge of Code page

  • oppose No reason given for merge. No reason for merge. Two big scope articles, plenty in each. Code pages are also an obvious sub-topic within the broader topic of encodiing. Why on earth would we do this? What do we gain? Andy Dingley (talk) 17:13, 27 February 2014 (UTC)[reply]
  • oppose "Code page" is not synonymous with "character encoding" as the lede to the Code page article originally stated. Code pages are an aspect of character encoding, but are not equivalent to the broader concept of character encoding, and code pages should be discussed in a separate article. BabelStone (talk) 19:15, 27 February 2014 (UTC)[reply]
  • Oppose: Just as already noted above; also, code pages are simply "excerpts" from complete character encodings, and they were created as workarounds before Unicode was available etc. Thus, merging these two articles woldn't make much sense. — Dsimic (talk | contribs) 19:29, 27 February 2014 (UTC)[reply]
  • oppose The two articles are related, but I could not find any overlap in the contents. This article is about character encoding in general. The code page article speaks about code page numbers, their origin, and their relation to the encoding. In my humble opinion, they are perfectly separated as should be. If the code page article was merged into the encoding article there would not be any redundancies to remove. The whole article would make a huge subsection of the encoding article - a perfect candidate to be extracted into a separate one :-) The only thing that comes to my mind is add a subsection with a few sentences about code page and a reference to the "main article". --Chiccodoro (talk) 08:45, 23 April 2014 (UTC)[reply]