Assembling WebAssembly

Jun 6, 2017

by JF Bastien, Keith Miller, Saam Barati

We’re pleased to announce that WebKit has a full WebAssembly implementation.

Dynamic Duo

WebAssembly is a no-nonsense sidekick to JavaScript. It isn’t meant to be hand-written; rather, it’s a low-level binary format designed to be a suitable compilation target for existing languages such as C++. The WebAssembly code that the browser sees will already have undergone high-level, language-specific optimizations. This is great because it means implementations don’t have to know about how C++ or other languages are optimized. Running expensive language-specific optimization on the developers’ machines allows WebKit to focus on target-specific optimizations. It also means that we can focus WebAssembly compiler optimizations on fast code delivery and predictable performance.

WebAssembly cannot do everything that JavaScript does. For example, WebAssembly cannot access the DOM except by calling into JavaScript. WebAssembly is meant to be used in conjunction with JavaScript. The JavaScript / WebAssembly dynamic duo work in partnership, allowing JavaScript to stay focused on taking over the world, while WebAssembly accelerates computation-intensive tasks.

The web wouldn’t be itself without security and portability. WebAssembly delivers on both fronts. It executes C++ code at near native speed with the same joker-less security guarantees that JavaScript offers. Our implementation supports WebAssembly on x86-64 as well as ARM64 platforms. Portable performance is really where WebAssembly shines: the WebAssembly virtual Instruction Set Architecture was designed to be an ideal compilation target for today’s modern processors.

Sidekick

WebAssembly fits in with the rest of the web platform, as a good sidekick would. It re-uses existing web APIs and exposes a new JavaScript API. WebAssembly has four core concepts: WebAssembly.Module, WebAssembly.Instance, WebAssembly.Memory, and WebAssembly.Table. A module represents code and is similar to a program on disk, whereas an instance is an execution of that program. There can be multiple concurrently executing instances of the same module. Finally, a memory represents an instance’s heap. It is contiguous, bounds-checked, and can be exposed to JavaScript as an ArrayBuffer. All WebAssembly memory operations operate on the instance’s memory. Finally, a table holds handles to WebAssembly functions, allowing indirect calls within an instance to target different functions with the same signature (in C++-speak: virtual functions and function pointers). Interestingly, instances can share the same memory, and tables can call directly across instances which enables dynamic linking.

Because WebAssembly exposes itself as a regular JavaScript object, we were able to reuse some machinery that already existed within WebKit. An interesting example is our reuse of our ECMAScript Module implementation to implement WebAssembly.Instance‘s API. For now it’s an implementation detail—invisible to web developers—but integration with ECMAScript Modules is being discussed. For developers who use modules, interacting between JavaScript and WebAssembly would then be totally seamless. Behind the mask of modules, our heroes’ secret identities wouldn’t be revealed.

To allow sharing of modules between Web Workers and to prepare ourselves for future features like threads, we’ve made our internal representation of WebAssembly code thread-safe. This lets developers postMessage a compiled WebAssembly.Module across workers without requiring re-compilation, copying, or any other redundant work. Our implementation of postMessage for modules is simpler than a riddle: sharing a module between workers involves passing a reference to our internal module representation to the other worker. That worker will run the same machine code as the agent that originally produced the module.

Utility Belt

WebAssembly directly exposes 32- and 64-bit integers as well as 32- and 64-bit floating point numbers. Its instruction set is equally simple:

i32.add	i64.add	f32.add	f64.add	i32.wrap/i64	i32.load8_s	i32.store8
i32.sub	i64.sub	f32.sub	f64.sub	i32.trunc_s/f32	i32.load8_u	i32.store16
i32.mul	i64.mul	f32.mul	f64.mul	i32.trunc_s/f64	i32.load16_s	i32.store
i32.div_s	i64.div_s	f32.div	f64.div	i32.trunc_u/f32	i32.load16_u	i64.store8
i32.div_u	i64.div_u	f32.abs	f64.abs	i32.trunc_u/f64	i32.load	i64.store16
i32.rem_s	i64.rem_s	f32.neg	f64.neg	i32.reinterpret/f32	i64.load8_s	i64.store32
i32.rem_u	i64.rem_u	f32.copysign	f64.copysign	i64.extend_s/i32	i64.load8_u	i64.store
i32.and	i64.and	f32.ceil	f64.ceil	i64.extend_u/i32	i64.load16_s	f32.store
i32.or	i64.or	f32.floor	f64.floor	i64.trunc_s/f32	i64.load16_u	f64.store
i32.xor	i64.xor	f32.trunc	f64.trunc	i64.trunc_s/f64	i64.load32_s
i32.shl	i64.shl	f32.nearest	f64.nearest	i64.trunc_u/f32	i64.load32_u
i32.shr_u	i64.shr_u			i64.trunc_u/f64	i64.load	call
i32.shr_s	i64.shr_s	f32.sqrt	f64.sqrt	i64.reinterpret/f64	f32.load	call_indirect
i32.rotl	i64.rotl	f32.min	f64.min		f64.load
i32.rotr	i64.rotr	f32.max	f64.max
i32.clz	i64.clz				nop	grow_memory
i32.ctz	i64.ctz				block	current_memory
i32.popcnt	i64.popcnt			f32.demote/f64	loop
i32.eqz	i64.eqz			f32.convert_s/i32	if	get_local
i32.eq	i64.eq			f32.convert_s/i64	else	set_local
i32.ne	i64.ne			f32.convert_u/i32	br	tee_local
i32.lt_s	i64.lt_s			f32.convert_u/i64	br_if
i32.le_s	i64.le_s			f32.reinterpret/i32	br_table	get_global
i32.lt_u	i64.lt_u	f32.eq	f64.eq	f64.promote/f32	return	set_global
i32.le_u	i64.le_u	f32.ne	f64.ne	f64.convert_s/i32	end
i32.gt_s	i64.gt_s	f32.lt	f64.lt	f64.convert_s/i64		i32.const
i32.ge_s	i64.ge_s	f32.le	f64.le	f64.convert_u/i32	drop	i64.const
i32.gt_u	i64.gt_u	f32.gt	f64.gt	f64.convert_u/i64	select	f32.const
i32.ge_u	i64.ge_u	f32.ge	f64.ge	f64.reinterpret/i64	unreachable	f64.const

The instructions are low-level by design, and it is this low-level quality that gives WebAssembly its power. WebAssembly was born as a compilation target, molded by compiler engineers.

OMG! BBQ!

WebKit’s WebAssembly implementation, like our JavaScript implementation, uses a tiering system to balance startup costs with throughput. Currently, there are two tiers to the engine: the Build Bytecode Quickly (BBQ) tier and the Optimized Machine-code Generator (OMG) tier. Both rely on the B3 JIT as their low-level optimizer.

WebAssembly modules can easily contain tons of code, some of which isn’t executed more than once or very frequently. This is why we opted to use two tiers: one that generates decent code quickly, and one that generates optimized code only when the engine thinks the code is hot enough to warrant it. BBQ compiles code about 4× as fast as OMG, but produces code that executes roughly 2× as slow as OMG. We use a background thread when compiling functions using the OMG. When OMG compilation finishes, we pause the executing WebAssembly threads and hot-patch the OMG compilation into the module.

WebAssembly being a low-level format—compared to JavaScript being a dynamic language—means that things aren’t as fickle when compiling WebAssembly. For example, WebAssembly statically tells us the types of all values within a function and gives us the signatures of all functions ahead of time.

BBQ 🔥

In order to produce executable code as soon as possible, the BBQ tier omits many of the optimizations possible in the B3 compiler. Additionally, the BBQ tier also uses a linear scan combined register / stack allocation algorithm. This allocates registers about 11× faster than the graph coloring algorithm that B3 usually uses. Avoiding expensive optimization allows WebKit to produce BBQ fast, so that BBQ may be consumed as soon as possible.

OMG 😲

When a function has been executed enough times, our WebAssembly runtime decides to optimize that function. Since WebAssembly does not require any type speculations, we only use tiering to conserve compile time. BBQ code contains only the profiling needed to detect when code has executed many times.

We share modules, along with all of their runtime state (like tiering-up from BBQ to OMG) between workers. Since we hot-patch the BBQ code, which might be executing on any number of threads, we need to be sure that each call-site can be concurrently updated to point to the OMG code. In order to avoid checking for new code on each call to a function, like JavaScript does, we track each call-site to every function, both in assembly and indirectly. Whenever the OMG version of a function is finished compiling, it replaces each call-site with a pointer to the code.

Mem-SIGNAL 🦇

One of the most important optimizations in WebAssembly is reducing the overhead on memory accesses while preserving security guarantees. WebAssembly specifies that all memory accesses are performed on a single 32-bit linear memory. We do this by reserving slightly more than 4GiB of virtual address space. This virtual reservation doesn’t occupy physical memory unless accessed. We use the hardware’s page protection mechanism to mark all but the lower pages as non-readable and non-writable. All loads and stores from this 32-bit address space can be performed normally, without explicit bounds checking instructions. If a load or store would be out of bounds, it will trigger a hardware fault that will ultimately result in a POSIX SIGSEGV or a Mach EXC_BAD_ACCESS exception. Our custom signal / Mach exception handlers then ensure the fault originated from a WebAssembly memory operation. If so, we set the instruction pointer to a special code stub that performs a WebAssembly trap and materializes a WebAssembly.RuntimeError in JavaScript. This optimization means that a WebAssembly memory operation typically results in a single load or store instruction. Overall, we measured a 15-20% speedup from this optimization on various WebAssembly benchmarks.

💥 KAPOW! 💥

Get in touch with us by filing a bug, or contacting JF, Keith, and Saam on Twitter.