Jump to content

Talk:Unicode in Microsoft Windows

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Artoria2e5 (talk | contribs) at 16:29, 9 May 2018 (Yes, chcp 65001 is a thing: new section). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
WikiProject iconComputing: Software Start‑class Low‑importance
WikiProject iconThis article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
StartThis article has been rated as Start-class on Wikipedia's content assessment scale.
LowThis article has been rated as Low-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Software.

Untitled

Much of the last (utf-8) paragraph is babble. One does not require utf8 support from the OS when there is utf16 support, since the conversions between utf8 and utf16 are very simple and mechanical and do not require last tables (like other unicode functionality) 88.159.79.148 (talk) 17:39, 6 February 2016 (UTC)[reply]

fopen("string",...) does not work and cannot open all possible files, due to the fact that utf-8 conversion is not done. This is a violation of the Posix and C-99 standard. Windows is broken, stop trying to claim otherwise. Yes you can work around it by converting the strings to UTF-16 and using Windows-specific api, but it is broken in that their standard c library does not do this.Spitzak (talk) 02:15, 9 February 2016 (UTC)[reply]

Yes, chcp 65001 is a thing

Assuming you can get your hands on Windows 10, grab a Ubuntu or any WSL system from the store. Run it, and you will see that conhost reports cp65001 in the window's properties.

WSL has a Binfmt_misc hook that lets the Win32 part run exe files, inheriting the WSL's many settings. One of these settings is the code page, and it causes bugs in old Python2 versions because Python2 does not know what the 65001 code page that Windows says it is using is.

If you read the workrounds in the bug, you will see that chcp 850 is used to switch to a encoding that Python2 understands, and chcp 65001 is used to switch it back after doing so. The full commands include /mnt/c/Windows/System32/cmd.exe /C , because that's how you point to cmd under WSL.

And yes, you can reproduce that without WSL. Open up cmd in Windows 10 and install Python 3.6, and you can:

C:\Python\Python36>chcp 437
Active code page: 437

C:\Python\Python36>set PYTHONLEGACYWINDOWSSTDIO=1

C:\Python\Python36>python -c print(__import__('sys').stdout.encoding)
cp437

C:\Python\Python36>chcp 65001
Active code page: 65001

C:\Python\Python36>python -c print(__import__('sys').stdout.encoding)
cp65001

PYTHONLEGACYWINDOWSSTDIO is needed to force Python to use the local code page because of PEP-0528, which uses "utf-8" by default. Before setting the variable, Python 3.6 will always report "utf-8".

--Artoria2e5 contrib 16:24, 9 May 2018 (UTC)[reply]

Regarding non-double-byte MBCSes: there is another four-byte-at-maximum code page in Windows called cp54936 (GB 18030). Like UTF-8, it too cannot be used for the locale code page. In fact all the locale MBCS code pages are DBCS, so the likely explanation is that many programs simply cannot handle three or more bytes. --Artoria2e5 contrib 16:28, 9 May 2018 (UTC)[reply]