Subtitle: Making Windows Slower Part 3: Process Destruction
In the summer of 2017 I wrestled with a Windows performance problem. Process destruction was slow, serialized, and was blocking the system input queue, leading to repeated short mouse-movement hangs when building Chrome. The root cause was that Windows was wasting a lot of time looking up GDI objects during process destruction, and it did this while holding the system-global user32 critical section. I shared the details in this blog post: 24-core CPU and I can’t move my mouse.
Microsoft fixed the bug, and I moved on with my life, but then it appeared to have come back. I heard complaints that the LLVM test suite was running slowly, with frequent input hangs.
But, in fact, the original bug was not back. It was our code that had changed.
The problem, circa 2017
Every Windows process contains several default GDI object handles. For processes that do nothing with graphics these handles will typically be NULL. During process destruction Windows called some functions on these handles – even when they were NULL. That didn’t matter – the functions were fast – until Windows 10 Anniversary Edition when some security changes made these functions slow. These functions run while holding the same lock that is used for input events, so enough processes being destroyed simultaneously, each making multiple calls to the slow function that held this critical lock, would be serialized and would block user input, leading to brief mouse-movement hangs.
Microsoft’s fix was to not call these functions for processes with no GDI objects. I never heard details but I think that Microsoft’s fix was the moral equivalent of this:
+ if (IsGUIProcess())
That is, skip doing GDI cleanup if the process isn’t a GUI/GDI process.
Since the compilers and other processes that we were rapidly creating and destroying didn’t have any GDI objects this change was enough to fix the UI hangs that we had been seeing.
The problem, circa 2018
It turns out that it is really easy to get processes that actually allocate some default GDI objects. If your process loads gdi32.dll then you automatically get default GDI objects (DCs, surfaces, regions, brushes, fonts, etc.), whether you need them or not (note that these default GDI objects don’t show up in Task Manager’s count of per-process GDI objects).
But that shouldn’t be a problem. I mean, why would a compiler be loading gdi32.dll? Well, it turns out that if you load user32.dll, shell32.dll, ole32.dll, or many other DLLs then you automatically get gdi32.dll as well (with the aforementioned default GDI objects). And accidentally loading one of these DLLs is really easy to do.
The LLVM test suite was calling CommandLineToArgvW (shell32.dll) during startup of every process, and it was occasionally calling SHGetKnownFolderPath (also shell32.dll). Having these calls was enough to pull in gdi32.dll and generate the dreaded default GDI objects. Since the LLVM test suite spawned a lot of processes it ended up being serialized during process destruction, causing huge slowdowns and input hangs, far worse than what I’d seen in 2017.
But this time we understood the underlying GDI locking issue so we immediately knew what to do.
The first fix was to avoid calling CommandLineToArgvW by manually parsing the command-line. After that change the LLVM test suite was rarely calling any functions in any of the problematic DLLs. But (and we knew this ahead of time) this made no difference to the performance. The reason it made no difference was that the remaining conditional call was sufficient to always pull in shell32.dll, which pulled in gdi32.dll, which created the default GDI objects.
The second fix was to delay load shell32.dll. Delay loading means that the DLLs are loaded on demand – when a function in them is called – instead of during process startup. This meant that shell32.dll and gdi32.dll would be loaded rarely, instead of always.
And with that, the LLVM test suite began running up to five times faster – one minute instead of five minutes. And, with no more mouse hangs our development machines were still usable while running the test suite. That is an insane speedup for some very modest changes and the author of the patches was so grateful for my investigation that they gave me a peer bonus.
Sometimes the tiniest changes have the biggest impacts. It’s all about knowing where to type zero.
The execution path not taken
It is worth repeating that we paid attention to code that was not being executed, and that has made all the difference. If you have a command-line only tool that doesn’t pull in gdi32.dll then adding code that conditionally calls a function may unconditionally increase your process-destruction costs, if the function’s presence causes gdi32.dll to be loaded. In the example below CommandLineToArgvW is never called, but without delay loading its mere presence will have performance consequences:
int main(int argc, char* argv)
if (argc < 0)
CommandLineToArgvW(nullptr, nullptr); // shell32.dll, pulls in gdi32.dll
So yes, deleting a call to a function – even if the code is never executed – can be enough to dramatically improve performance in some cases.
When I investigated the initial bug I wrote a program (ProcessCreateTests) that created 1,000 processes and then killed them all in parallel. This reproduced the hang, and when Microsoft fixed the bug I used test program to validate the fix – a video showing how can be found here. When this new variant was found I modified ProcessCreateTests to add a -user32 option and when this is specified every one of the thousand test process loads user32.dll, when otherwise they would not. As expected the time to destroy all 1,000 test processes increases dramatically when this option is specified, and mouse-movement hangs are easily observed. The time to create the processes also increases with the -user32 option, but no hangs are observed during process creation. You can use this to see just how horrible the problem can be. Some typical results from my four-core/eight-thread laptop after a week of up-time are shown here. The -user32 option increases the times for everything, but the increase in how long the UserCrit was blocked during process destruction is particularly dramatic:
Process creation took 2.448 s (2.448 ms per process).
Lock blocked for 0.008 s total, maximum was 0.001 s.
Process destruction took 0.801 s (0.801 ms per process).
Lock blocked for 0.004 s total, maximum was 0.001 s.
> ProcessCreatetests.exe -user32
Testing with 1000 descendant processes with user32.dll loaded.
Process creation took 3.154 s (3.154 ms per process).
Lock blocked for 0.032 s total, maximum was 0.007 s.
Process destruction took 2.240 s (2.240 ms per process).
Lock blocked for 1.991 s total, maximum was 0.864 s.
Digging deeper, just because
I thought of some ETW techniques that I could have used to better investigate this and started writing them up, but I ended up finding enough inexplicable behavior that I’ve spun off that research into a follow-up post. Suffice it to say that there is some evidence of more odd Windows behavior.