coldbrew
is a (WIP) tracing JIT compiler for the Java Virtual Machine, with
support for primitive numeric types (int
, long
, float
, double
) and
serves as a demo project for how JIT compilers work in genenral.
coldbrew
is inspired primarly by TigerShrimp1 and some ideas from Higgs2
the TigerShrimp C++ implementation3 is very readable and was of huge help
to debug some issues along the line.
Other implementations I've found useful is LuaJIT 2.0 and Mike Pall's email
about the LuaJIT internals which you can in the mailing list4.
While I tried to remain as close as the TigerShrimp implementation as possible, there are some changes in the overall structure since we are using Rust.
I was planning to start with ARM645 then do x86 but I changed machines and basically refactored the code to start with x86 support.
I was originally planning to use the C++ implementation as a baseline to test against but I didn't have much success building it.
coldbrew
bundles a traditional bytecode interpreter with a runtime for the JVM
as per the Java SE7 specification described in the link below6, during the
execution the bytecode is profiled and execution traces are recorded.
The trace contains all the information needed to compile the bytecode to native such as the entry, exit codes and the bytecode of the core loop. Once a trace is ready we pipeline it to the JIT cache for compilation and cachine, when we reach that code path again, execution leaves the VM and executes the compiled native trace before returning control to the VM.
The tracelet approach consists of compiling pseudo basic blocks, I use the term pseudo because when we compile a trace we're essentially compiling JVM bytecodes and we don't have an intermediate representation to truly know if it's a basic block. But looking empircally at the generated traces we can see that there are basic blocks (sequences of bytecodes without branches) followed by jumps.
This approach is suitable for starting out the JIT because while it's expensive to exit the JIT at each jump we get small chunks of assembly we can introspect and understand.
To make coldbrew
truly tracing we will need to handle side exits and trace
stitching. I've had a lot of misconceptions about handling exits especially
since each tracelet can be a succession of 6 bytecode instructions but each
are separated by one or two instructions that we can't jit compile, this is
especially complicated to do correctly because unlike a method-based Jit you
can't increment the program counter by the number of bytecode you just compiled
because they are interleaved with non-compiled instructions.
The HotpathVM paper7 has a nice explanation of this problem.
To identify hotpaths we use heuristics that target loop headers which we can identify when we encounter a backwards branch instruction. Once such branch is identified we record the program counter where we branch as the start of the loop.
But it's not sufficient to track backwards branches we need to calculate their execution frequency to identify if they are hot, the invocation frequency threshold (currently set to 2) triggewrs the start of recording.
An example of a trace would be a sequence of bytecode like this :
iload_1
sipush 1000
if_icmpge B
iinc 2, 1
iinc 1, 1
goto A
But the above trace contains two exits, the first to label B
and the second
to label A
so the tracelets would be essentially what is in between.
iload_1
sipush 1000
iinc 2, 1
iinc 1, 1
We could take this one step further and do something similar to speculative
execution, where at the B
branch if_icmpge B
we record a hint to whether
the branch should be taken or not. If the branch is not taken we can stitch
both tracelets and use the hint as a guard instruction. This is implemented
in flip_branch
where we flip the branch instruction during recording and
use it as an implicit guard during the runtime. If a guard instruction fails
at runtime then control is transfered back to the virtual machine interpreter
and resumes execution. This is where the need for side exits comes up because
when transfer control back to the VM any state mutations to local variables
need to be written back to the VM current frame.
So the above trace ends up as such :
1: iload_1
2: sipush 1000
3: if_icmpge .exit
4: iinc 1,1
5: iinc 0,1
6: goto 3
.exit :
trap
Where trap
is a pseudo instructions that signals the JIT to transfer control
back to the virtual machine. We also do a label swap where use relative in-trace
offsets, this is essentialy since when we compile goto
at line 6 we need to
jump to a label that exists within the trace. One way to solve this is to create
a local map of {pc, label}
where we compile goto 3
as jmp .label_3
where
.label_3
would be created during the initial recording. Other way to solve this
is to generate the code in reverse which allows you to encounter the target branch
in this case goto 3
before the actual target which eliminates the need for the stub
technique we mentionned which requires keeping track of all inner branches.
Recording bytecode aborts whenever we encounter a non recoverable state this involves native method invocations, exceptions and recursive functions.
When it comes to executing the trace we assemble the native trace using dynasm
and record it as a pointer to a function with the following signature.
fn execute(*mut i32, *mut u8) -> i32
The first argument is a pointer to the local variables which we cast to i32
this is done mainly because we used enum Value
to represent dynamic values
and we don't control the memory layout of the enums (plus it's simpler).
The second argument is a pointer to the existing traces and this is used to to keep executing if the exit program counter is another trace that we have already prepared.
Once execution is complete we overwrite the current VM frame local variables
array with the ones we passed to execute
which you will notice is set as
mutable. This is how we "save" mutated local state when exiting the JIT.
It is possible to take this approach even further by incorporating analysis within the trace recording phase. Essentially once you've recorded a trace you could transform it to SSA form and run any kind of analysis or optimization such as Dead-Code Elimination, Common Subexpression Elimination, Constant folding and Inlining and much more.
The following flamegraph shows the initial implementation performance on the integration test suite.
This was recorded without debug prints and show that we spend a lot of time in
fetch
this was due to using a HashMap
for recording code sections in the JVM
class file.
The first thing that came to mind was to switch to using a Vec
this improved
perf quite a bit and we went from spending 37% of the time in fetch
to about
4%. Now the biggest bottleneck was the JitCache::compile
function (but since
we try and compile everything) this was quite expected. Once we change how often
we compile we can build and stitch larger traces.
I would like to thank the authors of the TigerShrimp work and for providing their implementation. The thesis is an exellent introduction to Tracing JITs and is a must read to anyone who wishes to understand the overall architecture and details of tracing JIT interpreters.