Skip to content

Commit

Permalink
Update RyuJIT documentation (dotnet#755)
Browse files Browse the repository at this point in the history
* Update RyuJIT documentation

1. Convert links to reference new repo
2. Update various mostly stylistic things in the RyuJIT overview
3. Add more high-level strategy/steps statements to the RyuJIT porting guide

* Update for code review comments
  • Loading branch information
BruceForstall authored Dec 12, 2019
1 parent f58a38c commit 4a7a2a8
Show file tree
Hide file tree
Showing 2 changed files with 522 additions and 238 deletions.
162 changes: 123 additions & 39 deletions docs/design/coreclr/botr/porting-ryujit.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,115 @@
# RyuJIT: Porting to different platforms

## What is a Platform?
* Target instruction set and pointer size
* Target calling convention
First, understand the JIT architecture by reading [RyuJIT overview](ryujit-overview.md).

# What is a Platform?

* Target instruction set
* Target pointer size
* Target operating system
* Target calling convention and ABI (Application Binary Interface)
* Runtime data structures (not really covered here)
* GC encoding
* So far only JIT32_GCENCODER and everything else
* Debug info (so far mostly the same for all targets?)
* EH info (not really covered here)
* All targets use the same GC encoding scheme and APIs, except for Windows x86, which uses JIT32_GCENCODER.
* Debug information (mostly the same for all targets)
* EH (exception handling) information (not really covered here)

One advantage of the CLR is that the VM (mostly) hides the (non-ABI) OS differences
One advantage of the CLR is that the VM (mostly) hides the (non-ABI) OS differences.

## The Very High Level View
* 32 vs. 64 bits
* This work is not yet complete in the backend, but should be sharable
# The Very High Level View

The following components need to be updated, or target-specific versions created, for a new platform.

* The basics
* target.h
* Instruction set architecture:
* registerXXX.h
* emitXXX.h, emitfmtXXX.h
* instrsXXX.h, emitXXX.cpp and targetXXX.cpp
* lowerXXX.cpp
* codeGenXXX.cpp and simdcodegenXXX.cpp
* lsraXXX.cpp
* codegenXXX.cpp and simdcodegenXXX.cpp
* unwindXXX.cpp
* Calling Convention: all over the place
* Calling Convention and ABI: all over the place
* 32 vs. 64 bits
* Also all over the place. Some pointer size-specific data is centralized in target.h, but probably not 100%.

# Porting stages and steps

There are several steps to follow to port the JIT (some of which can be be done in parallel), described below.

## Initial bring-up

* Create the new platform-specific files
* Create the platform-specific build instructions (in CMakeLists.txt). This probably will require
new platform-specific build instructions at the root level, as well as the JIT level of the source tree.
* Focus on MinOpts; disable the optimization phases, or always test with `COMPlus_JITMinOpts=1`.
* Disable optional features, such as:
* `FEATURE_EH` -- if 0, all exception handling blocks are removed. Of course, tests with exception handling
that depend on exceptions being thrown and caught won't run correctly.
* `FEATURE_STRUCTPROMOTE`
* `FEATURE_FASTTAILCALL`
* `FEATURE_TAILCALL_OPT`
* `FEATURE_SIMD`
* Build the new JIT as an altjit. In this mode, a "base" JIT is invoked to compile all functions except
the one(s) specified by the `COMPlus_AltJit` variable. For example, setting `COMPlus_AltJit=Add` and running
a test will use the "base" JIT (say, the Windows x64 targetting JIT) to compile all functions *except*
`Add`, which will be first compiled by the new altjit, and if it fails, fall back to the "base" JIT. In this
way, only very limited JIT functionality need to work, as the "base" JIT takes care of most functions.
* Implement the basic instruction encodings. Test them using a method like `CodeGen::genArm64EmitterUnitTests()`.
* Implement the bare minimum to get the compiler building and generating code for very simple operations, like addition.
* Focus on the CodeGenBringUpTests (src\coreclr\tests\src\JIT\CodeGenBringUpTests), starting with the simple ones.
These are designed such that for a test `XXX.cs`, there is a single interesting function named `XXX` to compile
(that is, the name of the source file is the same as the name of the interesting function. This was done to make
the scripts to invoke these tests very simple.). Set `COMPlus_AltJit=XXX` so the new JIT only attempts to
compile that one function.
* Use `COMPlus_JitDisasm` to see the generated code for functions, even if the code isn't run.

## Expand test coverage

* Get more and more tests to run successfully:
* Run more of the `JIT` directory of tests
* Run all of the Pri-0 "innerloop" tests
* Run all of the Pri-1 "outerloop" tests
* It is helpful to collect data on asserts generated by the JIT across the entire test base, and fix the asserts in
order of frequency. That is, fix the most frequently occurring asserts first.
* Track the number of asserts, and number of tests with/without asserts, to help determine progress.

## Bring the optimizer phases on-line

* Run tests with and without `COMPlus_JITMinOpts=1`.
* It probably makes sense to set `COMPlus_TieredCompilation=0` (or disable it for the platform entirely) until much later.

## Improve quality

* When the tests pass with the basic modes, start running with `JitStress` and `JitStressRegs` stress modes.
* Bring `GCStress` on-line. This also requires VM work.
* Work on `COMPlus_GCStress=4` quality. When crossgen/ngen is brought on-line, test with `COMPlus_GCStress=8`
and `COMPlus_GCStress=C` as well.

## Work on performance

* Determine a strategy for measuring and improving performance, both throughput (compile time) and generated code
quality (CQ).

## Work on platform parity

* Implement features that were intentionally disabled, or for which implementation was delayed.
* Implement SIMD (`Vector<T>`) and hardware intrinsics support.

# Front-end changes

## Front-end changes
* Calling Convention
* Struct args and returns seem to be the most complex differences
* Importer and morph are highly aware of these
* E.g. fgMorphArgs(), fgFixupStructReturn(), fgMorphCall(), fgPromoteStructs() and the various struct assignment morphing methods
* E.g. `fgMorphArgs()`, `fgFixupStructReturn()`, `fgMorphCall()`, `fgPromoteStructs()` and the various struct assignment morphing methods
* HFAs on ARM
* Tail calls are target-dependent, but probably should be less so
* Intrinsics: each platform recognizes different methods as intrinsics (e.g. Sin only for x86, Round everywhere BUT amd64)
* Intrinsics: each platform recognizes different methods as intrinsics (e.g. `Sin` only for x86, `Round` everywhere BUT amd64)
* Target-specific morphs such as for mul, mod and div

## Backend Changes
# Backend Changes

* Lowering: fully expose control flow and register requirements
* Code Generation: traverse blocks in layout order, generating code (InstrDescs) based on register assignments on nodes
* Then, generate prolog & epilog, as well as GC, EH and scope tables
Expand All @@ -41,7 +119,8 @@ One advantage of the CLR is that the VM (mostly) hides the (non-ABI) OS differen
* Code sequences for prologs & epilogs
* Allocation & layout of frame

## Target ISA "Configuration"
# Target ISA "Configuration"

* Conditional compilation (set in jit.h, based on incoming define, e.g. #ifdef X86)
```C++
_TARGET_64_BIT_ (32 bit target is just ! _TARGET_64BIT_)
Expand All @@ -51,29 +130,32 @@ _TARGET_AMD64_, _TARGET_X86_, _TARGET_ARM64_, _TARGET_ARM_
* Target.h
* InstrsXXX.h
## Instruction Encoding
* The instrDesc is the data structure used for encoding
* It is initialized with the opcode bits, and has fields for immediates and register numbers.
* instrDescs are collected into groups
# Instruction Encoding
* The `insGroup` and `instrDesc` data structures are used for encoding
* `instrDesc` is initialized with the opcode bits, and has fields for immediates and register numbers.
* `instrDesc`s are collected into `insGroup` groups
* A label may only occur at the beginning of a group
* The emitter is called to:
* Create new instructions (instrDescs), during CodeGen
* Emit the bits from the instrDescs after CodeGen is complete
* Create new instructions (`instrDesc`s), during CodeGen
* Emit the bits from the `instrDesc`s after CodeGen is complete
* Update Gcinfo (live GC vars & safe points)
## Adding Encodings
# Adding Encodings
* The instruction encodings are captured in instrsXXX.h. These are the opcode bits for each instruction
* The structure of each instruction's encoding is target-dependent
* The structure of each instruction set's encoding is target-dependent
* An "instruction" is just the representation of the opcode
* An instance of "instrDesc" represents the instruction to be emitted
* An instance of `instrDesc` represents the instruction to be emitted
* For each "type" of instruction, emit methods need to be implemented. These follow a pattern but a target may have unique ones, e.g.
```C++
emitter::emitInsMov(instruction ins, emitAttr attr, GenTree* node)
emitter::emitIns_R_I(instruction ins, emitAttr attr, regNumber reg, ssize_t val)
emitter::emitIns_R_I(instruction ins, emitAttr attr, regNumber reg, ssize_t val)
emitter::emitInsTernary(instruction ins, emitAttr attr, GenTree* dst, GenTree* src1, GenTree* src2) (currently Arm64 only)
```

## Lowering
# Lowering

* Lowering ensures that all register requirements are exposed for the register allocator
* Use count, def count, "internal" reg count, and any special register requirements
* Does half the work of code generation, since all computation is made explicit
Expand All @@ -86,27 +168,29 @@ emitter::emitInsTernary(instruction ins, emitAttr attr, GenTree* dst, GenTree* s
* Sets register requirements
* sometimes changes the register requirements children (which have already been traversed)
* Sets the block order and node locations for LSRA
* LinearScan:: startBlockSequence() and LinearScan::moveToNextBlock()
* `LinearScan::startBlockSequence()` and `LinearScan::moveToNextBlock()`

# Register Allocation

## Register Allocation
* Register allocation is largely target-independent
* The second phase of Lowering does nearly all the target-dependent work
* Register candidates are determined in the front-end
* Local variables or temps, or fields of local variables or temps
* Not address-taken, plus a few other restrictions
* Sorted by lvaSortByRefCount(), and marked "lvTracked"
* Sorted by `lvaSortByRefCount()`, and determined by `lvIsRegCandidate()`

# Addressing Modes

## Addressing Modes
* The code to find and capture addressing modes is particularly poorly abstracted
* genCreateAddrMode(), in CodeGenCommon.cpp traverses the tree looking for an addressing mode, then captures its constituent elements (base, index, scale & offset) in "out parameters"
* It optionally generates code
* For RyuJIT, it NEVER generates code, and is only used by gtSetEvalOrder, and by lowering
* `genCreateAddrMode()`, in CodeGenCommon.cpp traverses the tree looking for an addressing mode, then captures its constituent elements (base, index, scale & offset) in "out parameters"
* It never generates code, and is only used by `gtSetEvalOrder`, and by Lowering

# Code Generation

## Code Generation
* For the most part, the code generation method structure is the same for all architectures
* Most code generation methods start with "gen"
* Theoretically, CodeGenCommon.cpp contains code "mostly" common to all targets (this factoring is imperfect)
* Method prolog, epilog,
* genCodeForBBList
* walks the trees in execution order, calling genCodeForTreeNode, which needs to handle all nodes that are not "contained"
* generates control flow code (branches, EH) for the block
* `genCodeForBBList()`
* Walks the trees in execution order, calling `genCodeForTreeNode()`, which needs to handle all nodes that are not "contained"
* Generates control flow code (branches, EH) for the block
Loading

0 comments on commit 4a7a2a8

Please sign in to comment.