Code page 932 (Microsoft Windows)
MIME / IANA | Windows-31J |
---|---|
Alias(es) | CP943C |
Language(s) | Japanese |
Standard | WHATWG Encoding Standard (as "Shift_JIS") |
Extends | Shift_JIS |
Microsoft Windows code page 932 (abbreviated MS932,[1][2] Windows-932[2] or ambiguously CP932), also called Windows-31J amongst other names (see § Terminology below), is Microsoft's extended variant of Shift JIS. It contains standard 7-bit ASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.
IBM offer the same extended double-byte codes in their code page 943 (IBM-943 or CP943),[3] which is a combination of the single-byte Code page 897 and the double-byte Code page 941.[4]
Terminology
Microsoft's Shift JIS variant is known simply as "Code page 932" on Microsoft Windows, however this is ambiguous as IBM's code page 932, while also a Shift JIS variant, lacks the NEC and NEC-selected double-byte vendor extensions which are present in Microsoft's variant (although both include the IBM extensions) and preserves the 1978 ordering of JIS X 0208.[3]
IBM's code page 943 (or "IBM-943") includes the same double byte codes as Windows code page 932.[3] Microsoft's version corresponds closely to the encoding referred to as ibm-943_P15A-2003 (with aliases including CP943C and Windows-932)[2] in International Components for Unicode (ICU). There is also a second ICU encoding named ibm-943_P130-1999,[5] which uses different single-byte mappings which more closely match IBM's code page definitions. (See § Single-byte character differences below for details.)
Windows code page 932 is registered with the IANA as Windows-31J.[6] The "Windows-31J" label is IANA's and not recognized by Microsoft, which has historically used "shift_jis" instead. The W3C/WHATWG encoding standard used by HTML5 matches Windows code page 932 (including the "formerly proprietary extensions from IBM and NEC"),[7] and treats the label "shift_jis" interchangeably with "windows-31j" with the intent of being "compatible with deployed content".[8]
Windows code page 932 is also called MS-Kanji,[2][9] although IANA treat MS-Kanji as an alias for standard Shift JIS.[6]
In Japanese editions of Windows, this code page is referred to as "ANSI", since it is the operating system's default 8-bit encoding, even though ANSI was not involved in its definition.
Differences from standard Shift JIS
Windows-31J is often mistaken for standard Shift JIS (as defined in JIS X 0208:1997 Appendix 1): while similar, the distinction is significant for computer programmers wishing to avoid mojibake.
Double-byte character differences
In addition to the standard JIS X 0201:1997 and JIS X 0208:1997 characters, Windows-31J includes several JIS X 0208 extensions, namely "NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)",[6] in addition to setting some encoding space aside for end user definition.[10] This also differs from IBM-932, which does not include the NEC extensions or NEC selection.[3]
Some of these representations were subsequently used for different characters by JIS X 0213 and Shift JIS-2004. For example, compare row 89 in JIS X 0213 (beginning 硃, 硎, 硏…)[11] to row 89 as used by JIS X 0208 with IBM/NEC extensions (beginning 纊, 褜, 鍈…).[12] Consequently, Shift JIS-2004 is not compatible with Windows-31J.
In addition to the above, Microsoft uses different (but visually similar) Unicode mapping for several double-byte punctuation characters compared to standard Shift JIS, such as the wave dash being mapped to U+FF5E rather than U+301C,[13] which is followed by ibm-943_P15A-2003[14] but not ibm-943_P130-1999,[15] and using different mapping for the double byte backslash.[13]
Single-byte character differences
Windows-932 includes standard 7-bit ASCII mappings for single-byte sequences with the high bit set to 0. Hence, codes 0x5C and 0x7E are mapped to Unicode as U+005C REVERSE SOLIDUS (\
, the backslash) and U+007E TILDE (~
) respectively,[16][17][13] as they are in ASCII (ISO-646-US). This is likewise done by the W3C/WHATWG encoding standard.[18] By contrast, 0x5C is mapped to U+00A5 YEN SIGN (¥
) in ISO-646-JP and consequently JIS X 0201, of which standard Shift JIS is an extension. Correspondingly, Windows-31J avoids duplicate encoding of the backslash by mapping the double byte 0x815F to U+FF3C FULLWIDTH REVERSE SOLIDUS, whereas standard Shift JIS maps it to U+005C.[13]
However, 0x5C in Windows-932 is nonetheless considered a Yen sign in certain contexts.[19] For this reason, in many Japanese fonts, U+005C is displayed as a Yen symbol, which would normally be represented as U+00A5, rather than as a backslash per Unicode's suggested rendering. U+00A5 is one-way best-fit mapped onto 0x5C in Windows-932. However, code 0x5C in Windows-932 behaves as a reverse solidus (backslash) in all respects (e.g. in file paths on Windows systems) other than how it is displayed by some fonts,[19] and Microsoft's documentation for Windows-932 displays 0x5C as a backslash.[17] This mapping[16] corresponds to the encoding named "ibm-943_P15A-2003" in International Components for Unicode (ICU),[2] except for minor reordering of a few C0 control characters.
IBM-943, like IBM-932,[3] is a superset of the single-byte Code page 897,[4] which maps 0x5C to the Yen symbol (¥
) and 0x7E to the overline (‾
),[20] this is followed by the encoding named "ibm-943_P130-1999" in ICU.[5] Code page 897 (and therefore also IBM-943 and IBM-932) also adds single-byte box-drawing characters replacing certain C0 control characters,[20] however these may still be treated as control characters depending on the context,[21] and are mapped to control characters in ICU.[5]
Layout
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
See also
References
- ^ Sivonen, Henri. "Bug 27851 - Add MS932 as a label of Shift_JIS". w3.org Bug Tracker.
- ^ a b c d e "Converter Explorer: ibm-943_P15A-2003 (alias windows-31j)". International Components for Unicode: ICU Demonstration.
- ^ a b c d e "IBM-943 and IBM-932". IBM Knowledge Center. IBM.
- ^ a b "Code Page 943". IBM.
- ^ a b c "Converter Explorer: ibm-943_P130-1999". International Components for Unicode: ICU Demonstration.
- ^ a b c "Character Sets". IANA.
- ^ "5. Indexes (§ Index jis0208)". Encoding Standard. WHATWG.
- ^ "4.2. Names and labels". Encoding Standard. WHATWG.
- ^ "7.2.3. Standard Encodings". Python 3.6 Documentation. Python Software Foundation. Retrieved 19 September 2017.
- ^ Kaplan, Michael S (2007-05-26). "The PUA outside of Unicode". Sorting it all out.
- ^ "233: Japanese Graphic Character Set for Information Interchange, Plane 1" (PDF). IPSJ.
- ^ "Index jis0208 visualization". Encoding Standard. WHATWG.
- ^ a b c d "Ambiguities in conversion from Shift-JIS to Unicode (Non-Normative)". XML Japanese Profile. W3C.
- ^ "Converter Explorer: ibm-943_P15A-2003: start byte 0x81". ICU Demonstration. International Components for Unicode.
- ^ "Converter Explorer: ibm-943_P130-1999: start byte 0x81". ICU Demonstration. International Components for Unicode.
- ^ a b "CP932.TXT". Unicode Consortium.
- ^ a b "Lead byte NULL — Code page 932". Microsoft.
- ^ "12.3.1. Shift_JIS decoder". Encoding Standard. WHATWG. "If byte is an ASCII byte or 0x80, return a code point whose value is byte."
- ^ a b Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?". Sorting it all out.
- ^ a b "CP00897.txt". IBM.
- ^ "Code page identifiers - CP 00897". IBM Globalization. IBM.
External links
Microsoft related
- Microsoft's Reference for Windows Code Page 932
- CP932.TXT: Mapping of Microsoft's Code Page 932 to Unicode
- ICU Code Page 943C (ibm-943_P15A-2003 alias windows-31j) demonstration