A Sound Approach to Responsiveness Scores
The latency component of JetStream is mostly made up of tests that measure the latency incurred when executing code for the first time. But it also includes two tests, splay-latency and mandreel-latency whose scores reflect how close a browser’s worst-case performance is to its average-case. A relatively good score on splay-latency indicates that a browser’s garbage collector is less likely to interfere with a web app’s responsiveness. Mandreel-latency is similar, but due to its larger code base and lower allocation rate, it primarily tests whether a browser’s just-in-time compiler interferes with responsiveness. Both of these tests originate from the Octane version 2 benchmark suite.
Splay-latency and mandreel-latency are intended to reward responsive browsers. A responsive browser will always execute the task quickly, rather than usually executing it quickly but sometimes having a hiccup with very bad performance. But these tests use the wrong metric for computing a responsiveness score: they first compute the root mean squared (RMS) of thousands of samples, and then report a score that is the reciprocal of the RMS. It turns out that this number doesn’t adequately reward browsers that have great responsiveness, and in some cases, it gives a high score to browsers that are less responsive.
For example, consider what would happen if Browser A always ran a task in 20 milliseconds, while Browser B ran the same task in either 10 milliseconds or 5 milliseconds at random with equal probability. Clearly, Browser B would appear to be more responsive – it always completes the task at least twice as fast as browser A. But, using RMS as the latency metric means that Browser A gets a higher score: since its performance is always 20 milliseconds, it will have an RMS of zero, which leads to a latency score of 1/0, i.e. infinity. Browser B will have an RMS around 3.5, leading to a score of 1/3.5, i.e. less than infinity. Setting aside the absurdity of a browser getting an infinite score — that’s also something we want to prevent — it’s clear that Browser A has a higher score on these tests despite always being less responsive. We’d like to be able to use these tests to tune the responsiveness of WebKit: we want to accept changes that improve the score and reject those that degrade it. But we can’t do this if the score rewards bad behavior.
The reason is that the RMS will increase (and the 1/RMS score will decrease) whenever there is an outlier in either direction. Both good outliers (the browser sometimes finishes a task much faster than normal) and bad outliers (the browser sometimes takes a long time) increase the RMS by the same amount. Instead of RMS, we want a one-sided metric, which punishes bad outliers while ignoring the good ones — we are fine with browsers sometimes being much faster than the average so long as the worst case is still great.
The simplest solution would be to compute a score based on the slowest execution out of the thousands of executions that these tests do. It’s tempting to do this, but it carries a cost: the worst case can be a noisy number. We want to reduce the measurement noise, since we want to be able to confidently measure small differences in performance between different versions of our code. An easy way to get most of the benefit of a worst-case measurement while reducing noise is to take the 0.5% worst of all samples and report their average.
Using the average of the worst samples never punishes browsers for sometimes running faster than normal. It zeroes in on exactly what we mean by responsiveness: it’s how fast the browser will run in the worst case. Consider how this affects splay-latency. The samples above the 99.5 percentile — roughly 10 samples in a typical run of splay — are exactly those where WebKit’s garbage collector has to do a full scan of the heap. Browsers that can split the heap scan into smaller chunks of work that never incur one long pause will score better on this test. It’s easy to intuit about the splay-latency and mandreel-latency scores: if one browser scores 2× higher than another, it means that this browser’s occasional performance hiccups will be 2× less severe.
1. CDjs uses red-black trees rather than hashtables for its various mappings. This makes CDjs slightly more faithful to the concept of real-time, since hashtables may have random worst-case pathologies due to collisions. Also, while we were tempted to use the new Map API in ES6, we weren’t able to do so in CDjs because it needs to use objects’ values as keys rather than object identity. The red-black tree implementation is a port of WebKit’s existing red-black tree. 2. CDjs doesn’t attempt to reuse objects quite as aggressively as CDj did. This makes CDjs more faithful to the experimental question that CDx was trying to ask: how much slower do you run when you write a real-time workload in a high-level language? The answer is more meaningful if the high-level program fully utilizes the features of the high-level language, like garbage collection.
We run CDjs as a benchmark by simulating 1,000 aircraft that occasionally collide. We run the benchmark for 200 frames. This makes the benchmark run for around one second on a modern browser and typical notebook computer. The score is the inverse of the average of the 10 worst execution times from those 200 frames. The addition of this benchmark doesn’t change the overall weighting of JetStream between latency and throughput, though it does shift the definition of latency in the direction of worst-case of many samples rather than the cost of cold start.
The addition of CDjs to JetStream was tracked by WebKit bug 146156.
We are happy to introduce this significant update to the JetStream benchmark suite. This update makes JetStream’s latency metrics more accurate than they were before. The new suite is now available to run at browserbench.org/JetStream/. As always, file bugs at bugs.webkit.org if you find any issues in this benchmark.