Jump to content

Code page 932 (Microsoft Windows)

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by HarJIT (talk | contribs) at 14:36, 9 April 2019 (Single-byte character differences). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Windows Code page 932
MIME / IANAWindows-31J
Alias(es)CP943C
Language(s)Japanese
StandardWHATWG Encoding Standard (as "Shift_JIS")
ClassificationExtended ASCII,[a] Variable-width encoding, CJK encoding
ExtendsShift_JIS
  1. ^ Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.

Microsoft Windows code page 932 (abbreviated MS932,[1][2] Windows-932[2] or ambiguously CP932[3]), also called Windows-31J amongst other names (see § Terminology below), is the Microsoft Windows code page for the Japanese language, which is an extended variant of the Shift JIS Japanese character encoding. It contains standard 7-bit ASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.

IBM offer the same extended double-byte codes in their code page 943 (IBM-943 or CP943),[4] which is a combination of the single-byte Code page 897 and the double-byte Code page 941.[5]

Terminology

Microsoft's Shift JIS variant is known simply as "Code page 932" on Microsoft Windows, however this is ambiguous as IBM's code page 932, while also a Shift JIS variant, lacks the NEC and NEC-selected double-byte vendor extensions which are present in Microsoft's variant (although both include the IBM extensions) and preserves the 1978 ordering of JIS X 0208.[4]

IBM's code page 943 (or "IBM-943") includes the same double byte codes as Windows code page 932.[4] Microsoft's version corresponds closely to the encoding referred to as ibm-943_P15A-2003 (with aliases including CP943C and Windows-932)[2] in International Components for Unicode (ICU). There is also a second ICU encoding named ibm-943_P130-1999,[6] which uses different single-byte mappings which more closely match IBM's code page definitions. (See § Single-byte character differences below for details.)

Windows code page 932 is registered with the IANA as Windows-31J.[7] The "Windows-31J" label is IANA's and not recognized by Microsoft, which has historically used "shift_jis" instead.[8] The W3C/WHATWG encoding standard used by HTML5 treats the label "shift_jis" interchangeably with "windows-31j" with the intent of being "compatible with deployed content"[9] and matches Windows code page 932 (including the "formerly proprietary extensions from IBM and NEC").[10]

Windows code page 932 is also called MS_Kanji,[2][11] although IANA treat MS_Kanji as an alias for standard Shift JIS.[7]

In Japanese editions of Windows, this code page is referred to as "ANSI", since it is the operating system's default 8-bit encoding, even though ANSI was not involved in its definition.

Differences from standard Shift JIS

Windows-31J is often mistaken for standard Shift JIS (as defined in JIS X 0208:1997 Appendix 1): while similar, the distinction is significant for computer programmers wishing to avoid mojibake.

Double-byte character differences

Euler diagram comparing repertoires of JIS X 0208, JIS X 0212, JIS X 0213, Windows-31J, the Microsoft standard repertoire and Unicode.

In addition to the standard JIS X 0201:1997 and JIS X 0208:1997 characters, Windows-31J includes several JIS X 0208 extensions, namely "NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)",[7] in addition to setting some encoding space aside for end user definition.[12] This also differs from IBM-932, which does not include the NEC extensions or NEC selection.[4]

Some of these representations were subsequently used for different characters by JIS X 0213 and Shift JIS-2004. For example, compare row 89 in JIS X 0213 (beginning 硃, 硎, 硏…)[13] to row 89 as used by JIS X 0208 with IBM/NEC extensions (beginning 纊, 褜, 鍈…).[14] Consequently, Shift JIS-2004 is not compatible with Windows-31J.

In addition to the above, Microsoft uses different (but visually similar) Unicode mapping for several double-byte punctuation characters compared to standard Shift JIS, such as the wave dash being mapped to U+FF5E rather than U+301C,[15] which is followed by ibm-943_P15A-2003[16] but not ibm-943_P130-1999,[17] and using different mapping for the double byte backslash.[15]

Single-byte character differences

Windows-932 includes standard 7-bit ASCII mappings for single-byte sequences with the high bit set to 0. Hence, codes 0x5C and 0x7E are mapped to Unicode as U+005C REVERSE SOLIDUS (\, the backslash) and U+007E TILDE (~) respectively,[18][19][15] as they are in ASCII (ISO-646-US). This is likewise done by the W3C/WHATWG encoding standard.[20] By contrast, 0x5C is mapped to U+00A5 YEN SIGN (¥) in ISO-646-JP and consequently JIS X 0201, of which standard Shift JIS is an extension. Correspondingly, Windows-31J avoids duplicate encoding of the backslash by mapping the double byte 0x815F to U+FF3C FULLWIDTH REVERSE SOLIDUS, whereas standard Shift JIS maps it to U+005C.[15]

However, 0x5C in Windows-932 is nonetheless considered a Yen sign in certain contexts.[21] For this reason, in many Japanese fonts, U+005C is displayed as a Yen symbol, which would normally be represented as U+00A5, rather than as a backslash per Unicode's suggested rendering. U+00A5 is one-way best-fit mapped onto 0x5C in Windows-932. However, code 0x5C in Windows-932 behaves as a reverse solidus (backslash) in all respects (e.g. in file paths on Windows systems) other than how it is displayed by some fonts,[21] and Microsoft's documentation for Windows-932 displays 0x5C as a backslash.[19] This mapping[18] corresponds to the encoding named "ibm-943_P15A-2003" in International Components for Unicode (ICU),[2] except for minor reordering of a few C0 control characters.

IBM-943, like IBM-932,[4] is a superset of the single-byte Code page 897,[5] which maps 0x5C to the Yen symbol (¥) and 0x7E to the overline (),[22] this is followed by the encoding named "ibm-943_P130-1999" in ICU.[6] Code page 897 (and therefore also IBM-943 and IBM-932) also adds single-byte box-drawing characters replacing certain C0 control characters,[22] however these may still be treated as control characters depending on the context,[23] and are mapped to control characters in ICU.[6]

Layout

First byte
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~
8
9
A
B ソ
C
D
E
F
Second byte
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
 
Non printable ASCII character
ASCII character
ASCII character, may be substituted by localized fonts
Single-byte half-width katakana
First byte of a double-byte character, used by JIS X 0208
First byte of a double-byte NEC or NEC-selected extension character
Not used as first byte, unallocated space in JIS X 0208
First byte of a double-byte IBM extension character
First byte of a double-byte IBM-designated user defined character
Not used as first byte, best-fit mapped as single byte to private use area
Second byte of a double-byte character whose first half of the JIS sequence was odd
Second byte of a double-byte character whose first half of the JIS sequence was even
Unused as second byte of a double-byte character


See also

References

  1. ^ Sivonen, Henri. "Bug 27851 - Add MS932 as a label of Shift_JIS". w3.org Bug Tracker.
  2. ^ a b c d e "Converter Explorer: ibm-943_P15A-2003 (alias windows-31j)". International Components for Unicode: ICU Demonstration.
  3. ^ Aoki, Osamu. "Chapter 11. Data conversion". Debian Reference. Debian.
  4. ^ a b c d e "IBM-943 and IBM-932". IBM Knowledge Center. IBM.
  5. ^ a b "Coded character set identifiers - CCSID 943". IBM Globalization. IBM. Archived from the original on 2016-03-15.
  6. ^ a b c "Converter Explorer: ibm-943_P130-1999". International Components for Unicode: ICU Demonstration.
  7. ^ a b c "Character Sets". IANA.
  8. ^ "Encoding.WindowsCodePage Property - .NET Framework (current version)". MSDN. Microsoft.
  9. ^ "4.2. Names and labels". Encoding Standard. WHATWG.
  10. ^ "5. Indexes (§ Index jis0208)". Encoding Standard. WHATWG.
  11. ^ "7.2.3. Standard Encodings". Python 3.6 Documentation. Python Software Foundation. Retrieved 19 September 2017.
  12. ^ Kaplan, Michael S (2007-05-26). "The PUA outside of Unicode". Sorting it all out.
  13. ^ "233: Japanese Graphic Character Set for Information Interchange, Plane 1" (PDF). IPSJ.
  14. ^ "Index jis0208 visualization". Encoding Standard. WHATWG.
  15. ^ a b c d "Ambiguities in conversion from Shift-JIS to Unicode (Non-Normative)". XML Japanese Profile. W3C.
  16. ^ "Converter Explorer: ibm-943_P15A-2003: start byte 0x81". ICU Demonstration. International Components for Unicode.
  17. ^ "Converter Explorer: ibm-943_P130-1999: start byte 0x81". ICU Demonstration. International Components for Unicode.
  18. ^ a b "CP932.TXT". Unicode Consortium.
  19. ^ a b "Lead byte NULL — Code page 932". Microsoft.
  20. ^ van Kesteren, Anne. "12.3.1. Shift_JIS decoder". Encoding Standard. WHATWG. If byte is an ASCII byte or 0x80, return a code point whose value is byte.
  21. ^ a b Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?". Sorting it all out.
  22. ^ a b "CP00897.txt". IBM. Archived from the original on 2019-01-12. {{cite web}}: Unknown parameter |dead-url= ignored (|url-status= suggested) (help)
  23. ^ "Code page identifiers - CP 00897". IBM Globalization. IBM. Archived from the original on 2016-03-17. {{cite web}}: Unknown parameter |dead-url= ignored (|url-status= suggested) (help)