Assembling WebAssembly

We’re pleased to announce that WebKit has a full WebAssembly implementation.

Dynamic Duo

WebAssembly is a no-nonsense sidekick to JavaScript. It isn’t meant to be hand-written; rather, it’s a low-level binary format designed to be a suitable compilation target for existing languages such as C++. The WebAssembly code that the browser sees will already have undergone high-level, language-specific optimizations. This is great because it means implementations don’t have to know about how C++ or other languages are optimized. Running expensive language-specific optimization on the developers’ machines allows WebKit to focus on target-specific optimizations. It also means that we can focus WebAssembly compiler optimizations on fast code delivery and predictable performance.

WebAssembly cannot do everything that JavaScript does. For example, WebAssembly cannot access the DOM except by calling into JavaScript. WebAssembly is meant to be used in conjunction with JavaScript. The JavaScript / WebAssembly dynamic duo work in partnership, allowing JavaScript to stay focused on taking over the world, while WebAssembly accelerates computation-intensive tasks.


JavaScript and its sidekick, cosplaying as 1966 Batman

The web wouldn’t be itself without security and portability. WebAssembly delivers on both fronts. It executes C++ code at near native speed with the same joker-less security guarantees that JavaScript offers. Our implementation supports WebAssembly on x86-64 as well as ARM64 platforms. Portable performance is really where WebAssembly shines: the WebAssembly virtual Instruction Set Architecture was designed to be an ideal compilation target for today’s modern processors.

Sidekick

WebAssembly fits in with the rest of the web platform, as a good sidekick would. It re-uses existing web APIs and exposes a new JavaScript API. WebAssembly has four core concepts: WebAssembly.Module, WebAssembly.Instance, WebAssembly.Memory, and WebAssembly.Table. A module represents code and is similar to a program on disk, whereas an instance is an execution of that program. There can be multiple concurrently executing instances of the same module. Finally, a memory represents an instance’s heap. It is contiguous, bounds-checked, and can be exposed to JavaScript as an ArrayBuffer. All WebAssembly memory operations operate on the instance’s memory. Finally, a table holds handles to WebAssembly functions, allowing indirect calls within an instance to target different functions with the same signature (in C++-speak: virtual functions and function pointers). Interestingly, instances can share the same memory, and tables can call directly across instances which enables dynamic linking.

Because WebAssembly exposes itself as a regular JavaScript object, we were able to reuse some machinery that already existed within WebKit. An interesting example is our reuse of our ECMAScript Module implementation to implement WebAssembly.Instance‘s API. For now it’s an implementation detail—invisible to web developers—but integration with ECMAScript Modules is being discussed. For developers who use modules, interacting between JavaScript and WebAssembly would then be totally seamless. Behind the mask of modules, our heroes’ secret identities wouldn’t be revealed.

To allow sharing of modules between Web Workers and to prepare ourselves for future features like threads, we’ve made our internal representation of WebAssembly code thread-safe. This lets developers postMessage a compiled WebAssembly.Module across workers without requiring re-compilation, copying, or any other redundant work. Our implementation of postMessage for modules is simpler than a riddle: sharing a module between workers involves passing a reference to our internal module representation to the other worker. That worker will run the same machine code as the agent that originally produced the module.

Utility Belt

WebAssembly directly exposes 32- and 64-bit integers as well as 32- and 64-bit floating point numbers. Its instruction set is equally simple:

i32.add i64.add f32.add f64.add i32.wrap/i64 i32.load8_s i32.store8
i32.sub i64.sub f32.sub f64.sub i32.trunc_s/f32 i32.load8_u i32.store16
i32.mul i64.mul f32.mul f64.mul i32.trunc_s/f64 i32.load16_s i32.store
i32.div_s i64.div_s f32.div f64.div i32.trunc_u/f32 i32.load16_u i64.store8
i32.div_u i64.div_u f32.abs f64.abs i32.trunc_u/f64 i32.load i64.store16
i32.rem_s i64.rem_s f32.neg f64.neg i32.reinterpret/f32 i64.load8_s i64.store32
i32.rem_u i64.rem_u f32.copysign f64.copysign i64.extend_s/i32 i64.load8_u i64.store
i32.and i64.and f32.ceil f64.ceil i64.extend_u/i32 i64.load16_s f32.store
i32.or i64.or f32.floor f64.floor i64.trunc_s/f32 i64.load16_u f64.store
i32.xor i64.xor f32.trunc f64.trunc i64.trunc_s/f64 i64.load32_s
i32.shl i64.shl f32.nearest f64.nearest i64.trunc_u/f32 i64.load32_u
i32.shr_u i64.shr_u i64.trunc_u/f64 i64.load call
i32.shr_s i64.shr_s f32.sqrt f64.sqrt i64.reinterpret/f64 f32.load call_indirect
i32.rotl i64.rotl f32.min f64.min f64.load
i32.rotr i64.rotr f32.max f64.max
i32.clz i64.clz nop grow_memory
i32.ctz i64.ctz block current_memory
i32.popcnt i64.popcnt f32.demote/f64 loop
i32.eqz i64.eqz f32.convert_s/i32 if get_local
i32.eq i64.eq f32.convert_s/i64 else set_local
i32.ne i64.ne f32.convert_u/i32 br tee_local
i32.lt_s i64.lt_s f32.convert_u/i64 br_if
i32.le_s i64.le_s f32.reinterpret/i32 br_table get_global
i32.lt_u i64.lt_u f32.eq f64.eq f64.promote/f32 return set_global
i32.le_u i64.le_u f32.ne f64.ne f64.convert_s/i32 end
i32.gt_s i64.gt_s f32.lt f64.lt f64.convert_s/i64 i32.const
i32.ge_s i64.ge_s f32.le f64.le f64.convert_u/i32 drop i64.const
i32.gt_u i64.gt_u f32.gt f64.gt f64.convert_u/i64 select f32.const
i32.ge_u i64.ge_u f32.ge f64.ge f64.reinterpret/i64 unreachable f64.const

The instructions are low-level by design, and it is this low-level quality that gives WebAssembly its power. WebAssembly was born as a compilation target, molded by compiler engineers.

OMG! BBQ!

WebKit’s WebAssembly implementation, like our JavaScript implementation, uses a tiering system to balance startup costs with throughput. Currently, there are two tiers to the engine: the Build Bytecode Quickly (BBQ) tier and the Optimized Machine-code Generator (OMG) tier. Both rely on the B3 JIT as their low-level optimizer.

WebAssembly modules can easily contain tons of code, some of which isn’t executed more than once or very frequently. This is why we opted to use two tiers: one that generates decent code quickly, and one that generates optimized code only when the engine thinks the code is hot enough to warrant it. BBQ compiles code about 4× as fast as OMG, but produces code that executes roughly 2× as slow as OMG. We use a background thread when compiling functions using the OMG. When OMG compilation finishes, we pause the executing WebAssembly threads and hot-patch the OMG compilation into the module.

WebAssembly being a low-level format—compared to JavaScript being a dynamic language—means that things aren’t as fickle when compiling WebAssembly. For example, WebAssembly statically tells us the types of all values within a function and gives us the signatures of all functions ahead of time.

BBQ 🔥

In order to produce executable code as soon as possible, the BBQ tier omits many of the optimizations possible in the B3 compiler. Additionally, the BBQ tier also uses a linear scan combined register / stack allocation algorithm. This allocates registers about 11× faster than the graph coloring algorithm that B3 usually uses. Avoiding expensive optimization allows WebKit to produce BBQ fast, so that BBQ may be consumed as soon as possible.

OMG 😲

When a function has been executed enough times, our WebAssembly runtime decides to optimize that function. Since WebAssembly does not require any type speculations, we only use tiering to conserve compile time. BBQ code contains only the profiling needed to detect when code has executed many times.

We share modules, along with all of their runtime state (like tiering-up from BBQ to OMG) between workers. Since we hot-patch the BBQ code, which might be executing on any number of threads, we need to be sure that each call-site can be concurrently updated to point to the OMG code. In order to avoid checking for new code on each call to a function, like JavaScript does, we track each call-site to every function, both in assembly and indirectly. Whenever the OMG version of a function is finished compiling, it replaces each call-site with a pointer to the code.

Mem-SIGNAL 🦇

One of the most important optimizations in WebAssembly is reducing the overhead on memory accesses while preserving security guarantees. WebAssembly specifies that all memory accesses are performed on a single 32-bit linear memory. We do this by reserving slightly more than 4GiB of virtual address space. This virtual reservation doesn’t occupy physical memory unless accessed. We use the hardware’s page protection mechanism to mark all but the lower pages as non-readable and non-writable. All loads and stores from this 32-bit address space can be performed normally, without explicit bounds checking instructions. If a load or store would be out of bounds, it will trigger a hardware fault that will ultimately result in a POSIX SIGSEGV or a Mach EXC_BAD_ACCESS exception. Our custom signal / Mach exception handlers then ensure the fault originated from a WebAssembly memory operation. If so, we set the instruction pointer to a special code stub that performs a WebAssembly trap and materializes a WebAssembly.RuntimeError in JavaScript. This optimization means that a WebAssembly memory operation typically results in a single load or store instruction. Overall, we measured a 15-20% speedup from this optimization on various WebAssembly benchmarks.

WASM Signal
WebAssembly memory signal

💥 KAPOW! 💥

Get in touch with us by filing a bug, or contacting JF, Keith, and Saam on Twitter.