-
Notifications
You must be signed in to change notification settings - Fork 68
Architecture: Nanoscope ROM
Fork of AOSP that enables Nanoscope's extremely low-overhead method tracing.
- Overview
- API
- Trace File Format
- Trace Data Buffer
- Writing to the Buffer
- Flushing the Buffer
- Generic Timer Register
- Trace Extensions
- Building from Sources
Trace every Java method with minimal instrumentation overhead.
- Upon entering a method, write a method pointer and timestamp to our trace data buffer to represent a
push
to the callstack. - On method exit, write a
nullptr
and timestamp to represent apop
. - When we stop tracing, process and flush the trace data to disk.
nanoscope-propertywatcher.h
The System Property API is useful for controlling tracing from the command line. This is the recommended approach for integrating with external tools.
# Data will be flushed to /data/data/com.example/files/data.txt
adb shell setprop dev.nanoscope com.example:data.txt
adb shell setprop dev.nanoscope \'\'
-
thread.cc
Thread::SetThreadName(const char*)
The Java API is useful for controlling tracing programmatically and/or if you want to trace a Thread that is not the Main Thread. If you need reproducible trace start and end points, use this method instead of the System Property API. Note that with this method, you do not need to run nanoscope start/stop
.
Thread.currentThread().setName("START_TRACING");
// Data will be flushed to /data/data/com.example/files/data.txt
Thread.currentThread().setName("STOP_TRACING:/data/data/com.example/files/data.txt");
When using the Java API, you do not need to run nanoscope start/stop
. Once the data is flushed, pull the trace fill off the device and open it with nanoscope open <tracefile>
.
It's true that it seems more natural to expose a new API called Thread.startTracing()
, for example. The problem with this approach is that this new method wouldn't be visible to any apps as they'd be compiled against the standard Android SDK. Hijacking the Thread.setName
method allows us to provide an API that's accessible from all apps. And don't worry, the calls above won't actually change the Thread name.
The data is first flushed to a temporary file. Once the the file is fully written, it's moved over to the final location. Watch the final file path - when it exists, the flush is complete.
After stopping a trace, the trace data is processed and flushed to the specified location on disk. This section defines the trace file format.
The file format is a simple human-readable push/pop format. Each line conforms to this pattern:
[timestamp]:[methodSignature | "POP"]
If the name after the colon is POP
, then it represents a pop from the callstack. Otherwise, it's a push. Here's an example of a trace output file - it represents a parent method that calls two sibling methods:
12009599181900:void com.example.Test.methodA(java.lang.String)
12009599181946:void com.example.Test.methodB(java.lang.String)
12009599181982:POP
12009599181996:void com.example.Test.methodC()
12009599182019:POP
12009599182019:POP
The Test
class may have looked something like this:
class Test {
void methodA(String str) {
methodB(str);
methodC();
}
void methodB(String str) {}
void methodC() {}
}
Why not Google's Trace Event Format?
Nanoscope can easily generate millions of trace events. We found that Google's Trace Event Format was too verbose at this scale and slowed down the generation and parsing of the trace file.
-
thread.h
Thread.tlsPtr_.trace_data
Our guiding principle here is to do as little work as possible during method execution. Since memory access is likely to be a instrumentation performance bottleneck, we want to reduce the amount of data written at the start and end of every method. With that in mind, we have a minimal in-memory representation of the trace data that conceptually looks like this:
struct Event {
ArtMethod* method;
uint64_t timestamp;
};
Event[] trace_data;
The above Event
struct doesn't exist in code, but it does accurately represent how we store the trace data in memory. In practice, we store the trace data as an int64_t
array (Thread.tlsPtr_.trace_data
) for simplicity and to allow for easy extensions to the base format.
We've described the in-memory representation of our trace data above. This section describes how we write trace data to the buffer at the beginning and end of every Java method.
if (tlsPtr_.trace_data_ptr != nullptr) { // If our buffer exists then trace this Thread.
*tlsPtr_.trace_data_ptr++ = reinterpret_cast<int64_t>(method); // Write ArtMethod pointer to the buffer.
*tlsPtr_.trace_data_ptr++ = generic_timer_count(); // Write timer count to the buffer.
}
if (tlsPtr_.trace_data_ptr != nullptr) {
*tlsPtr_.trace_data_ptr++ = 0; // Only difference from Method Entry is we write a nullptr to indicate a pop from the callstack.
*tlsPtr_.trace_data_ptr++ = generic_timer_count();
}
Since Java code can be executed either as interpreted Dalvik bytecode or as compiled native instructions we cover the specifics of each scenario in the subsections below.
-
interpreter.cc
interpreter::Execute(Thread*, DexFile::CodeItem*, ShadowFrame&, JValue, bool)
-
thread.cc
Thread::TraceStart(ArtMethod*)
Thread::TraceEnd()
The interpreter logic is written exactly as shown in General Strategy. You can find the actual code for method entry and method exit in Thread::TraceStart
and Thread::TraceEnd
, respectively. All interpreted method execution flows through interpreter::Execute
so it's straightforward to instrument all of these calls by executing Thread::TraceStart
and Thread::TraceEnd
at appropriate times in this method.
-
code_generator_arm[64].cc
InstructionCodeGeneratorARM[64]::VisitTraceStart(HTraceStart*)
InstructionCodeGeneratorARM[64]::VisitTraceEnd(HTraceEnd*)
-
instruction_builder.cc
HInstructionBuilder::Build()
HInstructionBuilder::BuildReturn(Instruction&, Primitive::Type, uint32_t)
-
nodes.h
HGraph
HTraceStart
HTraceEnd
To improve runtime performance, Java code is sometimes compiled to native instructions (ART documentation). In these cases, calls don't flow through any single method like they do in the interpreter. The lack of a single instrumentable method complicates the instrumentation as we're forced to inject the appropriate logic at the beginning and end of every compiled method. We've added the following logic to accomplish this:
- Define two new pseudo-instructions (
HTraceStart
/HTraceEnd
) to be used in the intermediate representation (IR). - When we build the IR graph (
HGraph
), we add code to insert theHTraceStart
andHTraceEnd
at appropriate locations. This is done inHInstructionBuilder::Build
andHInstructionBuilder::BuildReturn
. - Now that we've marked where the trace logic shall be inserted, we can override the methods
InstructionCodeGeneratorARM[64]::VisitTraceStart
andInstructionCodeGeneratorARM[64]::VisitTraceEnd
where we'll actually generate the assembly for the method entry and method exit logic.
Below is the ARMv8-A assembly we need to generate at the beginning of each method. This logic lives in InstructionCodeGeneratorARM64::VisitTraceStart
. Similar logic is required at the end of each method:
; TR = thread register
; MR = method register
; data_ptr = temp register
; timestamp = temp register
VisitTraceStart:
LDR data_ptr, [TR, data_ptr_offset] ; data_ptr = thread.data_ptr
CBZ data_ptr, done ; if (data_ptr != null)
MRS timestamp, SYS_CNTVCT_EL0 ; timestamp = <Generic Timer count>
STR MR, [data_ptr], +8 ; *data_ptr++ = method
STR timestamp, [data_ptr], +8 ; *data_ptr++ = timestamp
STR data_ptr, [TR, data_ptr_offset] ; *thread.data_ptr = data_ptr
done:
-
thread.cc
flush_trace_data
Upon stopping tracing, we kick off a background thread to process and flush the trace data buffer. You can find this logic in the flush_trace_data
method. Here's a pseudocode representation of the flushing process:
// TIMER_FREQUENCY: Timer ticks per nanosecond
// ptr: Pointer to beginning of trace data buffer
// end: Pointer representing the end of data written to trace data buffer
while (ptr < end) {
methodName = convertToName(ptr++);
timestamp = convertToNanoseconds(ptr++);
writeToFile(methodName, timestamp);
}
convertToName(ptr) {
return ptr == 0 ? "POP" : resolveArtMethodName(ptr);
}
convertToNanoseconds(ptr) {
return ptr / TIMER_FREQUENCY;
}
-
thread.cc
generic_timer_count()
Reporting the current time is a potential bottleneck for instrumentation performance. To reduce the overhead of calculating the current timestamp and to simplify the generated machine code, we read directly from ARM's Generic Timer Register.
generic_timer_count()
In the interpreter, we use the following code to read the timestamp for 64-bit processes:
asm volatile("mrs %0, cntvct_el0" : "=r"(t));
The volatile
here is necessary to prevent the compiler from optimizing away multiple calls to generic_timer_count
which would result in stale timestamps.
-
code_generator_arm64.cc
SYS_CNTVCT_EL0
-
assembler_thumb2.cc
Thumb2Assembler::mrrc(Register, Register, int, int, int)
Ultimately, we generate the same assembly used in the interpreter to retrieve the timer value. The 64-bit and 32-bit code generators, however, each have limitations that make the implementation non-trivial.
The proper instruction to read from a system register is MRS as we saw in the interpreter implementation. Unfortunately, the code generation library that CodeGeneratorARM64
leverages (VIXL) doesn't include a definition for the cntvct_el0
Generic Timer count register. To work around this, we define this register ourselves based on the cntvct_el0
register documentation:
uint16_t SYS_CNTVCT_EL0 = ((1 << SysO0_offset) |
(3 << SysOp1_offset) |
(14 << CRn_offset) |
(0 << CRm_offset) |
(2 << SysOp2_offset)) >> ImmSystemRegister_offset;
Once we have this register properly defined, generating the proper instruction is simple:
__ Mrs(timestamp, (SystemRegister) SYS_CNTVCT_EL0);
Similarly to the 64-bit code generator, the 32-bit version, CodeGeneratorARM
, also lacks a definition of the cntvct_el0
system register. To make matters worse, its assembler doesn't even support the the instruction we need (mrrc
). So the first step is to implement the code generation logic for mrrc
in assembler_thumb2.cc
:
void Thumb2Assembler::mrrc(Register r1, Register r2, int opc1, int coproc, int crm) {
// ARM Architecture Reference Manual: Thumb-2 Supplement 4.6.81
uint32_t hw1 = (0b111011000101 << 4) | (static_cast<uint32_t>(r2) << 0);
uint32_t hw2 = (static_cast<uint32_t>(r1) << 12) | (coproc << 8) | (opc1 << 4) | (crm << 0);
// Note that Emit32 swaps hw1 and hw2 parts for us so we order them in the natural order here.
Emit32(hw1 << 16 | hw2);
}
The code above writes the raw bits required to generate the mrrc
instruction with the given parameters. With the new instruction defined, we can generate the register read:
__ mrrc(temp1, temp2, 0b0001, 0b1111, 0b1110);