This post aims to help people that want to better understand performance bottlenecks in their application. There are many existing methodolgies to do performance anlysis, but not so many of them are robust and formal. When I was just starting with performance work I usually just profiled the app and tried to grasp through the hotspots of the benchmark hoping to find something there. This often lead to random experiments with unrolling, vectorization, inlining, you name it. I’m not saying it’s always a loosing strategy. Sometimes you can be lucky to get big performance boost from random experiments. But usually you need to have very good intuition and luck :).
In this post I show more formal way to do performance analysis. It’s called Top-down Microarchitecture Analysis Method (TMAM) (Intel® 64 and IA-32 Architectures Optimization Reference Manual, Appendix B.1). In this metodology we try to detect what was stalling our execution starting from the high-level components (like Front End, Back End, Retiring, Branch predictor) and narrowing down the source of performance inefficiencies.
It’s an iterative process with 2 steps:
- Identify the type of the performance problem.
- Locate the exact place in the code using PEBS (precise event).
After fixing performance issue you repeat the process again.
If it doesn’t make sense to you yet, don’t worry, it’ll become clear with the example.
TMAM conceptually works in a “black box” manner, with assumption that we don’t know nothing about the benchmark. Let’s imagine we have the binary (
a.out) and it runs for 8.5 sec:
$ time -p ./a.out real 8.53
This is the two most important pictures for TMAM (taken from Intel manual, see link above). First is the breakdown of metric levels in TMAM and second shows reasoning for a single uop:
We will run our app and collect specific metrics that will help us to characterize our application. We will try to detect in which category our application will fall to.
Let’s run it, collecting level-1 metrics on our binary:
$ ~/pmu-tools/toplev.py --core S0-C0 -l1 -v --no-desc taskset -c 0 ./a.out ... S0-C0 FE Frontend_Bound: 13.81 +- 0.00 % Slots below S0-C0 BAD Bad_Speculation: 0.22 +- 0.00 % Slots below S0-C0 BE Backend_Bound: 53.43 +- 0.00 % Slots <== S0-C0 RET Retiring: 32.53 +- 0.00 % Slots below S0-C0-T0 MUX: 100.00 +- 0.00 % S0-C0-T1 MUX: 100.00 +- 0.00 %
All right, now we know that we are bounded by backend. Let’s drill one level down:
$ ~/pmu-tools/toplev.py --core S0-C0 -l2 -v --no-desc taskset -c 0 ./a.out ... S0-C0 FE Frontend_Bound: 13.92 +- 0.00 % Slots below S0-C0 BAD Bad_Speculation: 0.23 +- 0.00 % Slots below S0-C0 BE Backend_Bound: 53.39 +- 0.00 % Slots S0-C0 RET Retiring: 32.49 +- 0.00 % Slots S0-C0 FE Frontend_Bound.Frontend_Latency: 12.11 +- 0.00 % Slots below S0-C0 FE Frontend_Bound.Frontend_Bandwidth: 1.84 +- 0.00 % Slots below S0-C0 BAD Bad_Speculation.Branch_Mispredicts: 0.22 +- 0.00 % Slots below S0-C0 BAD Bad_Speculation.Machine_Clears: 0.01 +- 0.00 % Slots below S0-C0 BE/Mem Backend_Bound.Memory_Bound: 44.59 +- 0.00 % Slots <== S0-C0 BE/Core Backend_Bound.Core_Bound: 8.80 +- 0.00 % Slots below S0-C0 RET Retiring.Base: 24.83 +- 0.00 % Slots below S0-C0 RET Retiring.Microcode_Sequencer: 7.65 +- 0.00 % Slots
Okay, we see that we are actually bounded by memory. Almost half of the execution time CPU was stalled waiting for memory requests to arrive. Let’s try one level deeper:
$ ~/pmu-tools/toplev.py --core S0-C0 -l3 -v --no-desc taskset -c 0 ./a.out ... S0-C0 FE Frontend_Bound: 13.91 +- 0.00 % Slots below S0-C0 BAD Bad_Speculation: 0.24 +- 0.00 % Slots below S0-C0 BE Backend_Bound: 53.36 +- 0.00 % Slots S0-C0 RET Retiring: 32.41 +- 0.00 % Slots S0-C0 FE Frontend_Bound.Frontend_Latency: 12.10 +- 0.00 % Slots below S0-C0 FE Frontend_Bound.Frontend_Bandwidth: 1.85 +- 0.00 % Slots below S0-C0 BAD Bad_Speculation.Branch_Mispredicts: 0.23 +- 0.00 % Slots below S0-C0 BAD Bad_Speculation.Machine_Clears: 0.01 +- 0.00 % Slots below S0-C0 BE/Mem Backend_Bound.Memory_Bound: 44.58 +- 0.00 % Slots S0-C0 BE/Core Backend_Bound.Core_Bound: 8.78 +- 0.00 % Slots below S0-C0 RET Retiring.Base: 24.77 +- 0.00 % Slots below S0-C0 RET Retiring.Microcode_Sequencer: 7.63 +- 0.00 % Slots S0-C0 FE Frontend_Bound.Frontend_Bandwidth.MITE: 7.33 +- 0.00 % CoreClocks below S0-C0 FE Frontend_Bound.Frontend_Bandwidth.DSB: 2.19 +- 0.00 % CoreClocks below S0-C0 FE Frontend_Bound.Frontend_Bandwidth.LSD: 0.00 +- 0.00 % CoreClocks below S0-C0-T0 FE Frontend_Bound.Frontend_Latency.ICache_Misses: 0.05 +- 0.00 % Clocks below S0-C0-T0 FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 0.00 +- 0.00 % Clocks below S0-C0-T0 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 0.10 +- 0.00 % Clocks_Estimated below S0-C0-T0 FE Frontend_Bound.Frontend_Latency.DSB_Switches: 0.00 +- 0.00 % Clocks below S0-C0-T0 FE Frontend_Bound.Frontend_Latency.LCP: 0.00 +- 0.00 % Clocks below S0-C0-T0 FE Frontend_Bound.Frontend_Latency.MS_Switches: 3.86 +- 0.00 % Clocks S0-C0-T0 BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 4.39 +- 0.00 % Stalls below S0-C0-T0 BE/Mem Backend_Bound.Memory_Bound.L2_Bound: 2.42 +- 0.00 % Stalls below S0-C0-T0 BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 5.75 +- 0.00 % Stalls S0-C0-T0 BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 47.11 +- 0.00 % Stalls <== S0-C0-T0 BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 0.69 +- 0.00 % Stalls below S0-C0-T0 BE/Core Backend_Bound.Core_Bound.Divider: 8.56 +- 0.00 % Clocks below S0-C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 11.31 +- 0.00 % Clocks below S0-C0-T0 RET Retiring.Base.FP_Arith: 1.45 +- 0.00 % Uops below S0-C0-T0 RET Retiring.Base.Other: 98.55 +- 0.00 % Uops below S0-C0-T0 RET Retiring.Microcode_Sequencer.Assists: 0.00 +- 0.00 % Slots_Estimated below S0-C0-T0 MUX: 3.45 +- 0.00 %
Cool! We found the bottleneck. Step #1 is completed. Let’s now go to the step #2.
To locate the place in the code where this is happening, we need to refer to the TMA metrics table. I know, I know, it looks scary and big. Don’t worry. It’s complex only when you see it the first time.
For Skylake architecture
DRAM_Bound metric is calculated using
CYCLE_ACTIVITY.STALLS_L3_MISS performance event. Let’s collect it:
$ perf stat -e cycles,cpu/event=0xa3,umask=0x6,cmask=0x6,name=CYCLE_ACTIVITY.STALLS_L3_MISS/ ./a.out 32226253316 cycles 19764641315 CYCLE_ACTIVITY.STALLS_L3_MISS
According to the definition of
CYCLE_ACTIVITY.STALLS_L3_MISS it counts cycles when execution stalls while L3 cache miss demand load is outstanding. We can see that there are ~60% of such cycles which is pretty bad.
Locate-with column there is performance event that we can use to locate exact place in the code where the issue occurs. For
DRAM_Bound metric we should use
MEM_LOAD_RETIRED.L3_MISS_PS precise event. Let’s sample on it:
$ perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp ./a.out
If you don’t understand the underlying mechanics of what we just did, I encourage you to read one of my previous posts: Basics of profiling with perf and Understanding performance events skid. Let’s look into the profile:
$ perf report -n --stdio ... # Samples: 33K of event 'MEM_LOAD_RETIRED.L3_MISS' # Event count (approx.): 71363893 # # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ................................ # 99.95% 33811 a.out a.out [.] foo <== 0.03% 52 a.out [kernel.kallsyms] [k] get_page_from_freelist 0.01% 3 a.out [kernel.kallsyms] [k] free_pages_prepare 0.00% 1 a.out [kernel.kallsyms] [k] free_pcppages_bulk
All L3 misses are caused by our code. Let’s drill down to assembly:
$ perf annotate --stdio -M intel foo Percent | Source code & Disassembly of a.out for MEM_LOAD_RETIRED.L3_MISS ------------------------------------------------------------------------------- : Disassembly of section .text: : : 0000000000400a00 <foo>: : foo(): 0.00 : 400a00: nop DWORD PTR [rax+rax*1+0x0] 0.00 : 400a08: nop DWORD PTR [rax+rax*1+0x0] ... 0.00 : 400df0: nop DWORD PTR [rax+rax*1+0x0] 0.00 : 400df8: nop DWORD PTR [rax+rax*1+0x0] 0.00 : 400e00: mov rax,QWORD PTR [rdi] 0.00 : 400e03: mov rax,QWORD PTR [rdi+0xa] 100.00 : 400e07: mov rax,QWORD PTR [rdi+rsi*1] <== 0.00 : 400e0b: mov rax,QWORD PTR [rdi+rax*1] 0.00 : 400e0f: mov rax,QWORD PTR [rdi+0x14] 0.00 : 400e13: xor rax,rax 0.00 : 400e16: ret
Just out of curiosity I collected the number of L3 misses:
$ perf stat -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ ./a.out 71370594 MEM_LOAD_RETIRED.L3_MISS
It shows that 7 out of each 10 iterations had loads that missed in L3.
Now that we know what instructions caused so many L3 misses let’s fix it.
Fixing the issue
Let’s look at the code:
extern "C" void foo(char* a, int n); const int _200MB = 1024*1024*200; int main() char* a = (char*)malloc(_200MB); // 200 MB buffer ... for (int i = 0; i < 100000000; i++) int random_int = distribution(generator); foo(a, random_int); ...
I allocate a big enough array to make it not fit in the L3 cache (L3 cache on the machine I was using is 38,5 MB – Intel(R) Xeon(R) Platinum 8180 CPU). Inside
foo function (written in assembly in order to avoid compiler optimizations) I’m reading random memory location:
foo: One_KB_of_nops # emulate some irrelevant work mov rax, QWORD [rdi + 0] # constant load mov rax, QWORD [rdi + 10] # constant load mov rax, QWORD [rdi + rsi] # load that goes to DRAM mov rax, QWORD [rdi + rax] # introduce dependency chain mov rax, QWORD [rdi + 20] # constant load
You probably already guessed what we should do. Yes, it’s prefetching. Because there is some significant time between the moment we get the next address we will read and actual load instruction, we can add prefetch hint (more details about
for (int i = 0; i < 100000000; i++) int random_int = distribution(generator); + __builtin_prefetch ( a + random_int, 0, 1); foo(a, random_int);
Code samples can be found on my github.
This hint improved execution time by 2 seconds (+30% speedup):
24621931288 cycles 2069238765 CYCLE_ACTIVITY.STALLS_L3_MISS 8889566 MEM_LOAD_RETIRED.L3_MISS 6,498080824 seconds time elapsed
Notice 10x less values for
MEM_LOAD_RETIRED.L3_MISS. However, it didn’t fully go away. There is one technique based on using LBR that can help us to estimate our prefetch window. If there will be interest in it I can write additional post about it. Just leave a comment if you interested.
Remember that TMAM is an iterative process, so we now need to repeat the process from the step #1. Likely it will move the bottleneck into some other bucket, probably Retiring. Ideally we want to be 100% bound by Retirement. Most of the time that means good thing, however not always. There are situations when you have very high retirement, but still app performs slow. This usually happens when Microcode sequencer starts feeding uops to the pipeline, like shown here.