BrokenPipeError錯誤和python subprocess.run()超時參數在Windows上無效

1、問題的發現

  今天,一個在windows上運行良好的python腳本放到linux下報錯,提示錯誤 BrokenPipeError: [Errno 32]Broken pipe。經調查是subprocess.run方法的timeout參數在linux上的表現和windows上不一致導致的。

try:
    ret = subprocess.run(cmd, shell=True, check=True, timeout=5,
          stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
except Exception as e:
    logging.debug(f"Runner FAIL")

2、問題描述

 為了描述這個問題,做了下面這個例子。subprocess.run調用了1個需要10s才能執行完的程序,但是卻設定了1s的超時時間。理論上這段代碼應該在1s後因超時退出,但事實並不如此。

import subprocess
import time

t = time.perf_counter()
args = 'python -c "import time; time.sleep(10)"'
try: 
    p = subprocess.run(args, shell=True, check=True,timeout=1,stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
except Exception as e: 
    print(f"except is {e}")
print(f'coast:{time.perf_counter() - t:.8f}s')

 在windows上測試:

PS C:\Users\peng\Desktop> Get-ComputerInfo | select WindowsProductName, WindowsVersion, OsHardwareAbstractionLayer

WindowsProductName WindowsVersion OsHardwareAbstractionLayer
------------------ -------------- --------------------------
Windows 10 Pro     2009           10.0.19041.2251
PS C:\Users\peng\Desktop> python  .\test_subprocess.py
except is Command 'python -c "import time; time.sleep(10)"' timed out after 1 seconds
coast:10.03642740s
PS C:\Users\peng\Desktop>

 在linux上測試:

21:51:31 wp@PowerEdge:~/bak$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
21:56:45 wp@PowerEdge:~/bak$ python test_subprocess.py
except is Command 'python -c "import time; time.sleep(10)"' timed out after 1 seconds
coast:1.00303393s
21:57:02 wp@PowerEdge:~/bak$

 可見,subprocess.run的timeout參數在windows下並沒有生效。subprocess.run執行指定的命令,等待命令執行完成後返回一個包含執行結果的CompletedProcess類的實例。這個函數的原型為:

subprocess.run(args, *, stdin=None, input=None, stdout=None, stderr=None, capture_output=False, 
  shell=False, cwd=None, timeout=None, check=False, encoding=None, errors=None, text=None, env=None, universal_newlines=None)
  • args:表示要執行的命令。必須是一個字符串,字符串參數列表。
  • stdin、stdout 和 stderr:子進程的標準輸入、輸出和錯誤。其值可以是 subprocess.PIPE、subprocess.DEVNULL、一個已經存在的文件描述符、已經打開的文件對象或者 None。subprocess.PIPE 表示為子進程創建新的管道。subprocess.DEVNULL 表示使用 os.devnull。默認使用的是 None,表示什麼都不做。另外,stderr 可以合併到 stdout 里一起輸出。
  • timeout:設置命令超時時間。如果命令執行時間超時,子進程將被殺死,並彈出 TimeoutExpired 異常。
  • check:如果該參數設置為 True,並且進程退出狀態碼不是 0,則彈 出 CalledProcessError 異常。
  • encoding: 如果指定了該參數,則 stdin、stdout 和 stderr 可以接收字符串數據,並以該編碼方式編碼。否則只接收 bytes 類型的數據。
  • shell:如果該參數為 True,將通過操作系統的 shell 執行指定的命令。

3、問題分析

subprocess.run 會等待進程終止並處理TimeoutExpired異常。在POSIX上,異常對象包含讀取部分的stdoutstderr位元組。上面測試在windows上失效的主要問題是使用了shell模式,啟動了管道,管道句柄可能由一個或多個後代進程繼承(如通過shell=True),所以當超時發生時,即使關閉了shell程序,而由shell啟動的其他程序,本例中是python程序依然在運行中,所以阻止了subprocess.run退出直至使用管道的所有進程退出。如果改為shell=False,則在windows上也出現1s的結果:

python  .\test_subprocess.py
except is Command 'python -c "import time; time.sleep(10)"' timed out after 1 seconds
coast:1.00460970s

  可以說這是windows實現上的一個缺陷,具體的可見:
//github.com/python/cpython/issues/87512
[subprocess] run() sometimes ignores timeout in Windows #87512


subprocess.run() handles TimeoutExpired by terminating the process and waiting on it. On POSIX, the exception object contains the partially read stdout and stderr bytes. For example:

cmd = 'echo spam; echo eggs >&2; sleep 2'
try: p = subprocess.run(cmd, shell=True, capture_output=True,
                        text=True, timeout=1)
except subprocess.TimeoutExpired as e: ex = e
 
>>> ex.stdout, ex.stderr
(b'spam\n', b'eggs\n')

 On Windows, subprocess.run() has to finish reading output with a second communicate() call, after which it manually sets the exception’s stdout and stderr attributes.
The poses the problem that the second communicate() call may block indefinitely, even though the child process has terminated.
 The primary issue is that the pipe handles may be inherited by one or more descendant processes (e.g. via shell=True), which are all regarded as potential writers that keep the pipe from closing. Reading from an open pipe that’s empty will block until data becomes available. This is generally desirable for efficiency, compared to polling in a loop. But in this case, the downside is that run() in Windows will effectively ignore the given timeout.
 Another problem is that _communicate() writes the input to stdin on the calling thread with a single write() call. If the input exceeds the pipe capacity (4 KiB by default — but a pipesize ‘suggested’ size could be supported), the write will block until the child process reads the excess data. This could block indefinitely, which will effectively ignore a given timeout. The POSIX implementation, in contrast, correctly handles a timeout in this case.
 Also, Popen.exit() closes the stdout, stderr, and stdin files without regard to the _communicate() worker threads. This may seem innocuous, but if a worker thread is blocked on synchronous I/O with one of these files, WinAPI CloseHandle() will also block if it’s closing the last handle for the file in the current process. (In this case, the kernel I/O manager has a close procedure that waits to acquire the file for the current thread before performing various housekeeping operations, primarily in the filesystem, such as clearing byte-range locks set by the current process.) A blocked close() is easy to demonstrate. For example:

args = 'python -c "import time; time.sleep(99)"'
p = subprocess.Popen(args, shell=True, stdout=subprocess.PIPE)
try: p.communicate(timeout=1)
except: pass

p.kill() # terminates the shell process -- not python.exe
with p: pass # stdout.close() blocks until python.exe exits

The Windows implementation of Popen._communicate() could be redesigned as follows:

  • read in chunks, with a size from 1 byte up to the maximum available,
    as determined by _winapi.PeekNamedPipe()
  • write to the child’s stdin on a separate thread
  • after communicate() has started, ensure that synchronous I/O in worker
    threads has been canceled via CancelSynchronousIo() before closing
    the pipes.
    The _winapi module would need to wrap OpenThread() and CancelSynchronousIo(), plus define the TERMINATE_THREAD (0x0001) access right.

With the proposed changes, subprocess.run() would no longer special case TimeoutExpired on Windows.