Optimizing WebKit & Safari for Speedometer 3.0

The introduction of Speedometer 3.0 is a major step forward in making the web faster for all, and allowing Web developers to make websites and web apps that were not previously possible. In this article, we explore ways the WebKit team made performance optimizations in WebKit and Safari based on the Speedometer 3.0 benchmark.

In order to make these improvements, we made an extensive use of our performance testing infrastructure. It’s integrated with our continuous integration, and provides the capability to schedule A/B tests. This allows engineers to quickly test out performance optimizations and catch new performance regressions.

Improving Tools

Proper tooling support is the key to identifying and addressing performance bottlenecks. We defined new internal JSON format for JavaScriptCore sampling profiler output to dump and process them offline. It includes a script which processes and generates analysis of hot functions and hot byte codes for JavaScriptCore. We also added FlameGraph generation tool for the dumped sampling profiler output which visualizes performance bottlenecks. In addition, we added support for JITDump generation on Darwin platforms to dump JIT related information during execution. And we improved generated JITDump information for easy use as well. These tooling improvements allowed us to quickly identify bottlenecks across Speedometer 3.0.

Improving JavaScriptCore

Revising Megamorphic Inline Cache (IC)

Megamorphic IC offers faster property access when one property access site observes many different object types and/or property names. We observed that some frameworks such as React contain a megamorphic property access. This led us to continuously improve JavaScriptCore’s megamorphic property access optimizations: expanding put megamorphic IC, adding megamorphic IC for the in operation, and adding generic improvements for megamorphic IC.

Revising Call IC

Call IC offers faster function calls by caching call targets inline. We redesigned Call IC and we integrated two different architectures into different tiers of Just-In-Time (JIT) compilers. Lower level tiers use Call IC without any JIT code generation and the highest level tier uses JIT code generatiton with the fastest Call IC. There is a tradeoff between code generation time and code efficiency, and JavaScriptCore performs a balancing act between them to achieve the best performance across different tiers.

Optimizing JSON

Speedometer 3.0 also presented new optimization opportunities to our JSON implementations as they contain more non-ASCII characters than before. We made our fast JSON stringifier work for unicode characters. We also analyzed profile data carefully and made JSON.parse faster than ever.

Adjusting Inlining Heuristics

There are many tradeoffs when inlining functions in JavaScript. For example, inline functions can more aggressively increase the total bytecode size and may cause memory bandwidth to become a new bottleneck. The amount of instruction cache available in CPU can also influence how effective a given inlining strategy is. And the calculus of these tradeoffs change over time as we make more improvements to JavaScriptCore such as adding new bytecode instruction and changes to DFG’s numerous optimization phases. We took the release of the new Speedometer 3.0 benchmark as an opportunity to adjust inlining heuristics based on data collected in modern Apple silicon Macs with the latest JavaScriptCore.

Make JIT Code Destruction Lazy

Due to complicated conditions, JavaScriptCore eagerly destroyed CodeBlock and JIT code when GC detects they are dead. Since these destructions are costly, they should be delayed and processed while the browser is idle. We made changes so that they are now destroyed lazily, during idle time in most cases.

Opportunistic Sweeping and Garbage Collection

In addition, we noticed that a significant amount of time goes into performing garbage collection and incremental sweeping across all subtests in both Speedometer 2.1 and 3.0. In particular, if a subtest allocated a large number of JavaScript objects on the heap, we would often spend a significant amount of time in subsequent subtests collecting these objects. This had several effects:

  1. Increasing synchronous time intervals on many subtests due to on-demand sweeping and garbage collection when hitting heap size limits.
  2. Increasing asynchronous time intervals on many subtests due to asynchronous garbage collection or timer-based incremental sweeping triggering immediately after the synchronous timing interval.
  3. Increasing overall variance depending on whether timer-based incremental sweeping and garbage collection would fall in the synchronous or asynchronous timing windows of any given subtest.

At a high level, we realized that some of this work could be performed opportunistically in between rendering updates — that is, during idle time — instead of triggering in the middle of subtests. To achieve this, we introduced a new mechanism in WebCore to provide hints to JavaScriptCore to opportunistically perform scheduled work after the previous rendering update has completed until a given deadline (determined by the estimated remaining time until the next rendering update). The opportunistic task scheduler also accounts for imminently scheduled zero delay timers or pending requestAnimationFrame callbacks: if it observes either, it’s less likely to schedule opportunistic work in order to avoid interference with imminent script execution. We currently perform a couple types of opportunistically scheduled tasks:

  • Incremental Sweeping: Prior to the opportunistic task scheduler, incremental sweeping in JavaScriptCore was automatically triggered by a periodically scheduled 100 ms timer. This had the effect of occasionally triggering incremental sweeping during asynchronous timing intervals, but also wasn’t aggressive enough to prevent on-demand sweeping in the middle of script execution. Now that JavaScriptCore is knowledgable about when to opportunistically schedule tasks, it can instead perform the majority of incremental sweeping in between rendering updates while there aren’t imminently scheduled timers. The process of sweeping is also granular to each marked block, which allows us to halt opportunistic sweeping early if we’re about to exceed the deadline for the next estimated rendering update.
  • Garbage Collection: By tracking the amount of time spent performing garbage collection in previous cycles, we’re able to roughly estimate the amount of time needed to perform the next garbage collection based on the number of bytes visited or allocated since the last cycle. If the remaining duration for performing opportunistically scheduled tasks is longer than this estimated garbage collection duration, we immediately perform either an Eden collection or full garbage collection. Furthermore, we integrated activity-based garbage collections into this new scheme to schedule them at appropriate timing.

Overall, this strategy yields a 6.5% total improvement in Speedometer 3.0*, decreasing the time spent in every subtest by a significant margin, and a 6.9% total improvement in Speedometer 2.1*, significantly decreasing the time spent in nearly all subtests.

* macOS 14.4, MacBook Air (M2, 2022)

Various Miscellaneous Optimizations for Real World Use Cases

We extensively reviewed all Speedometer 3.0 subtests and did many optimizations for realistic use cases. The examples include but are not limited to: faster Object.assign with empty objects, improving object spread performance, and so on.

Improving DOM code

Improving DOM code is Speedometer’s namesake, and that’s exactly what we did. For example, we now store the NodeType in the Node object itself instead of relying on a virtual function call. We also made DOMParser use a fast parser, improved its support of li elements, and made DOMParser not construct a redundant DocumentFragment. Together, these changes improved TodoMVC-JavaScript-ES5 by ~20%. We also eliminated O(n^2) behavior in the fast parser for about ~0.5% overall progression in Speedometer 3.0. We also made input elements construct their user-agent shadow tree lazily during construction and cloning, the latter of which is new in Speedometer 3.0 due to web components and Lit tests. We devirtualized many functions and inlined more functions to reduce the function call overheads. We carefully reviewed performance profile data and removed inefficiency in hot paths like repeated reparsing of the same URLs.

Improving Layout and Rendering

We landed a number of important optimizations in our layout and rendering code. First off, most type checks performed on RenderObject are now done using an inline enum class instead of virtual function calls, this alone is responsible for around ~0.7% of overall progression in Speedometer 3.0.

Improving Style Engine

We also optimized the way we compute the properties animated by Web Animations code. Previously, we were enumerating every animatable properties while resolving transition: all. We optimized this code to only enumerate affected properties. This was ~0.7% overall Speedometer 3.0 progression. Animating elements can now be resolved without fully recomputing their style unless necessary for correctness.

Speedometer 3.0 content, like many modern web sites, uses CSS custom properties extensively. We implemented significant optimizations to improve their performance. Most custom property references are now resolved via fast cache lookups, avoiding expensive style resolution time property parsing. Custom properties are now stored in a new hierarchical data structure that reduces memory usage as well.

One key component of WebKit styling performance is a cache (called “matched declarations cache”) that maps directly from a set of CSS declarations to the final element style, avoiding repeating expensive style building steps for identically styled elements. We significantly improved the hit rate of this cache.

We also improved styling performance of author shadow trees, allowing trees with identical styles to share style data more effectively.

Improving Inline Layout

We fixed a number of performance bottlenecks in inline layout engine as well. Eliminating complex text path in Editor-TipTap was a major ~7% overall improvement. To understand this optimization, WebKit has two different code paths for text layout: the simple text path, which uses low level font API to access raw font data, and the complex text path, which uses CoreText for complex shaping and ligatures. The simple text path is faster but it does not cover all the edge cases. The complex text path has full coverage but is slower than the simple text path.

Previously, we were taking the complex text path whenever a non-default value of font-feature or font-variant was used. This is because historically the simple text path wouldn’t support these operations. However, we noticed that the only feature of these still missing in the simple text path was font-variant-caps. By implementing font-variant-caps support for the simple text path, we allowed the simple text path to handle the benchmark content. This resulted in 4.5x improvement in Editor-TipTap subtest, and ~7% overall progression in Speedometer 3.0.

In addition to improving the handling of text content in WebKit, we also worked with CoreText team to avoid unnecessary work in laying out glyphs. This resulted in ~0.5% overall progression in Speedometer 3.0, and these performance gains will benefit not just WebKit but other frameworks and applications that use CoreText.

Improving SVG Layout

Another area we landed many optimizations for is SVG. Speedometer 3.0 contains a fair bit of SVG content in test cases such as React-Stockcharts-SVG. We were spending a lot of time computing the bounding box for repaint by creating GraphicsContext, applying all styles, and actually drawing strokes in CoreGraphics. Here, we adopted Blink’s optimization to approximate bounding box and made ~6% improvement in React-Stockcharts-SVG subtest. We also eliminated O(n^2) algorithm in SVG text layout code, which made some SVG content load a lot quicker.

Improving IOSurface Cache Hit Rate

Another optimization we did involve improving the cache hit rate of IOSurface. An IOSurface is a bitmap image buffer we use to paint web contents into. Since creating this object is rather expensive, we have a cache of IOSurface objects based on their dimensions. We observed that the cache hit rate was rather low (~30%) so we increased the cache size from 64MB to 256MB on macOS and improved the cache hit rate by 2.7x to ~80%, improving the overall Speedometer 3.0 score by ~0.7%. In practice, this means lower latency for canvas operations and other painting operations.

Reducing Wait Time for GPU Process

Previously, we required a synchronous IPC call from the Web Process to the GPU process to determine which of the existing buffers had been released by CoreAnimation and was suitable to use for the next frame. We optimized this by having the GPUP just select (or allocate) an appropriate buffer, and direct all incoming drawing commands to the right destination without requiring any response. We also changed the delivery of any newly allocated IOSurface handles to go via a background helper thread, rather than blocking the Web Process’s main thread.

Improving Compositing

Updates to compositing layers are now batched, and flushed during rendering updates, rather than computed during every layout. This significantly reduces the cost of script-incurred layout flushes.

Improving Safari

In addition to optimizations we made in WebKit, there were a handful of optimizations for Safari as well.

Optimizing AutoFill Code

One area we looked at was Safari’s AutoFill code. Safari uses JavaScript to implement its AutoFill logic, and this execution time was showing up in the Speedometer 3.0 profile. We made this code significantly faster by waiting for the contents of the page to settle before performing some work for AutoFill. This includes coalescing handling of newly focused fields until after the page had finished loading when possible, and moving lower-priority work out of the critical path of loading and presenting the page for long-loading pages. This was responsible for ~13% progression in TodoMVC-React-Complex-DOM and ~1% progression in numerous other tests, improving the overall Speedometer 3.0 score by ~0.9%.

Profile Guided Optimizations

In addition to making the above code changes, we also adjusted our profile-guided optimizations to take Speedometer 3.0 into account. This allowed us to improve the overall Speedometer 3.0 score by 1~1.6%. It’s worth noting that we observed an intricate interaction between making code changes and profile-guided optimizations. We sometimes don’t observe an immediate improvement in the overall Speedometer 3.0 score when we eliminate, or reduce the runtime cost of a particular code path until the daily update of profile-guided optimizations kicks. This is because the modified or newly added code has to benefit from profile-guided optimizations before it can show a measurable difference. In some cases, we even observed that a performance optimization initially results in a performance degradation until the profile-guided optimizations are updated.

Results

With all these optimizations and dozens more, we were able to improve the overall Speedometer 3.0 score by ~60% between Safari 17.0 and Safari 17.4. Even though individual progressions were often less than 1%, over time, they all stacked up together to make a big difference. Because some of these optimizations also benefited Speedometer 2.1, Safari 17.4 is also ~13% faster than Safari 17.0 on Speedometer 2.1. We’re thrilled to deliver these performance improvements to our users allowing web developers to build websites and web apps that are more responsive and snappier than ever.