Yes, we really did achieve a 3.7X speedup on a garbage collection benchmark by removing a call to sleep(). The patch generated lots of attention on reddit and Hacker News, and I’d like to take the opportunity to explain a little more about it, and about the WebKit project.
How we approach performance
The WebKit project practices benchmark-driven development: Select a benchmark, profile it, optimize, and then measure to verify an improvement. Focusing on real, measurable problems like this clarifies our thinking and helps us avoid getting side-tracked on theory and guesswork.
Why TCSpinLock called sleep()
With that background in mind, let’s return to TCSpinLock and its call to sleep(). The original TCSpinLock was benchmarked against ten threads running on four CPU cores. In this environment, one thread could monopolize a CPU core, while the thread holding the lock didn’t run at all. Sleeping a little bit was a simple way to guarantee that the thread holding the lock got the CPU.
WebKit used this version of TCSpinLock for seven years without any problems. Most calls to malloc in WebKit are single-threaded. The lock didn’t show up on the profile. We spent our time on more important things.
What the benchmark said
Things changed when we added multi-core parallelism to our garbage collector. We ran a stress-test benchmark designed to put maximum pressure on the collector. The profile showed lots of collector threads waiting for standard library locks. Even though the locks were contended for very short periods of time, the standard library would unschedule waiting threads, or even idle their CPU cores, and the latency to wake back up was too high. So, we started using TCSpinLock in more places, to optimize for the case where the lock was held for a short time.
Then, we profiled again, and saw the following pathology: All threads would try to acquire a TCSpinLock simultaneously, one would win, and the others would sleep for 2ms. On a 24-core system, if you sleep 23 of your cores for 2ms, you spend the next 2ms running 24X slower!
The solution was simple: don’t do that. WebKit takes care not to spawn more threads than cores, so monopolizing a core won’t stop overall forward progress. Also, as a fail-safe, we still yield to the OS scheduler to avoid monopolizing a core – we just don’t enforce a minimum sleep time. How do we know this works? Because the benchmark got faster, and no other benchmarks got slower.
What will get faster?
In the short run: garbage collection. On multi-core systems, the collector can support much heavier memory load without increased pause times. In the long run: lots of things. The collector is our most parallel system, but we’re moving toward more parallel algorithms over time. Having a solid spinlock implementation is an important building block.
Is this the “right” solution?
TCSpinLock was trying to solve a problem we don’t have. What’s the right solution for software that runs ten threads on four cores? For what it’s worth, TCSpinLock has moved away from simple sleep(), too, and implemented a complex system of queueing and random staggering. We’ll cross that bridge when it shows up on a benchmark profile.
Why not just use standard library locks?
We do! But not for everything. Our spinlocks take advantage of a critical optimization: Knowing they’ll be held for a short time. Since spinlocks never sleep, they recover from contention as quickly as possible. As the profile showed, standard locks are much more conservative, and they may even idle a CPU core.
Don’t just take my word for it!
WebKit has great performance because we constantly measure and optimize, and we don’t take anything for granted. That ethos applies to all of our code base, including this patch. You can use a nightly build to verify these results for yourself. And if you think there’s a better way to do things, submit a bug report with a benchmark showing why — or, even better, submit a patch.