Go compiler performance tooling: a step-by-step guide
Table of Contents
Recently I started looking at Go compiler performance as part of my work. I work with a large monorepo. Some code
is generated, and at that scale, lots of weird edge cases pop up. The underlying reasons do not matter, but it
got me curious about what cmd/compile spends its time on and whether any of it is
avoidable.
The goal: make the compiler faster without changing the code it generates. The output must be byte-identical. If it is not, the change has altered the language, not the speed.
This is different from application performance work. There, the focus is the code that runs. Here, it is the tool that builds it. Profiling, measuring, and proving safety all work differently.
Two questions every change must answer #
Every compiler performance change must prove two things:
- Safe: the generated code is byte-identical. Verified with
toolstash -cmpover the entire standard library. - Better: a statistically significant improvement. Verified with
compilebenchandbenchstatover at least 20 runs.
binaries as before?} B -- identical --> C{Statistically better?} B -- differs --> D[Discard: output changed] C -- confirmed --> E[Keep] C -- inconclusive --> F[Discard: no improvement]
Which metrics to trust #
benchstat with the -alloc flag produces five metric blocks:
| Metric | What it means | Deterministic? | Trust level |
|---|---|---|---|
allocs/op | Number of heap allocations per operation1 | Yes | Primary evidence. Same input, same count. |
B/op | Bytes of heap memory allocated per operation | Yes | Primary evidence. Deterministic. |
sec/op | Wall-clock seconds per operation | No | Only with perflock on a quiet, dedicated box. |
user-sec/op | User-space CPU seconds per operation | No | Directional only. Varies between runs. |
maxRSS/op | Peak resident memory (RSS) per operation | No | Often an artifact. Small packages hit a floor. |
allocs/op and B/op come from Go’s runtime memory counters. They are a property of the
compiler and its input, not the machine. Same compiler, same source file, same count every time.
sec/op is wall-clock time: the real elapsed seconds from start to finish. It depends on what else is running
on the machine, what the CPU frequency is, and whether anything is competing for CPU cache. A
0.5% allocation reduction will often show zero measurable change in sec/op, even on a quiet box. That does not
mean the change is pointless. Wall time is a blunt instrument for small, fixed-percentage improvements.
Allocation counts are the sharper one.
user-sec/op measures CPU time spent in user space (the program itself, excluding time the kernel spends on
its behalf for I/O and system calls). It is less affected by background load than sec/op but still varies
between runs because of thread scheduling and cache effects. Useful for spotting large regressions, not for
proving a small win.
maxRSS/op reports the highest amount of physical memory the compiler used during one compilation. For small
packages, the OS always reports the same minimum regardless of the change, so the number stops being useful.
Small positive deltas are usually noise, not real regressions.
The practical upshot: for allocation-focused work, publishable, reproducible numbers can be obtained on a laptop. No server room needed.
The workflow #
The compiler README lists the official set of helpful tools, including bent for large-scale benchmarks and view-annotated-file for overlaying compiler decisions onto source code. This section walks through my typical workflow and why each tool exists.
1. git worktree #
The comparison needs two builds of the compiler side by side: one without the change (baseline) and one with
(experiment). Without worktrees, switching branches means rebuilding the entire toolchain
(compiler, linker, and other build tools) every time. Two git worktree directories let both builds exist at
once.
The baseline worktree stays clean so its compiler binary can be used as the reference in benchmarks.
# create two worktrees from the same commit
git worktree add --detach "$BEFORE" HEAD
git worktree add --detach "$AFTER" HEAD
2. make.bash #
Go builds itself from source. There is no way to patch just the compiler in an existing
installation. make.bash (in the src/ directory) produces a runnable bin/go and a compile binary from
the source tree. It needs an existing Go installation to bootstrap from, set via
GOROOT_BOOTSTRAP.
Set GOTOOLCHAIN=local to prevent the built toolchain from auto-downloading a different version.
Without it, the carefully built compiler gets silently replaced.
# build the toolchain from source
cd "$BEFORE/src" && GOROOT_BOOTSTRAP=$(go env GOROOT) GOTOOLCHAIN=local ./make.bash
3. perflock #
The CPU adjusts its clock speed based on load and thermal conditions, which makes sec/op vary between runs.
perflock pins the frequency. A quiet, dedicated box removes remaining cache and scheduler noise.
allocs/op and B/op are deterministic regardless, so perflock is optional for allocation work.
# start the perflock daemon (Linux only)
sudo $(which perflock) -daemon
4. compile -bench #
The compile binary accepts a -bench flag that prints wall time for each phase (parsing, inlining, escape
analysis, SSA backend, object writing). A CPU profile weights all-thread time, which makes the parallel backend
look dominant. -bench shows the per-phase breakdown.
# print per-phase wall time for one package
compile -bench /dev/stdout -I "$std" -o /dev/null -p "$pkg" "${files[@]}"
compile is the binary built by make.bash, $std is a directory of prebuilt standard library imports,
$pkg is the package to profile, and ${files[@]} is its source files.
5. perf stat #
On Linux, perf stat wraps the compiler and reports hardware-level metrics: instructions per cycle
(IPC), page faults, branch misses, and the split between user and system time.
IPC measures how much work the CPU does per clock cycle. A value below 1.0 means the CPU is mostly waiting, usually for memory. The Go compiler sets a 128 MB heap goal at startup so that GC does not trigger until the heap reaches that size. This causes the OS to map many memory pages. This shows up as high page-fault counts and system time (time the OS spends handling those page faults, as opposed to user time where the compiler itself runs).
# hardware counters: IPC, page faults, user/sys split
perf stat compile -bench /dev/stdout -I "$std" -o /dev/null -p "$pkg" "${files[@]}"
6. go build -gcflags=-m #
Passing -gcflags=-m to go build tells the compiler to print its escape analysis2 decisions:
which variables stay on the stack and which escape to the heap. Once an allocation profile points to a function
allocating more than expected, this flag shows why.
# show which variables escape to the heap
go build -gcflags='-m' cmd/compile 2>&1 | grep 'escapes to heap'
7. toolstash #
After applying a change and rebuilding, the only compiler available is the changed one. toolstash save
snapshots the clean compiler binaries before the change. After rebuilding, toolstash -cmp runs both the saved
and live compilers on the same input and diffs the output byte for byte. Zero output means byte-identical.
Save before the change, not after. If saved after, the comparison proves nothing.
# snapshot the clean compiler
toolstash save
# rebuild with the change
go install cmd/compile
# zero output = byte-identical
go build -toolexec 'toolstash -cmp' -a std
8. compilebench #
go test -bench benchmarks application code, not the compiler. compilebench measures the compiler itself,
compiling standard library packages and reporting the metrics from the table above. Use -compile to point at
each worktree’s compiler binary. Run baseline and changed passes sequentially, never in parallel.
# baseline: compile with the clean compiler
compilebench -alloc -count 20 -run "$RUN" -compile "$BEFORE/pkg/tool/$A/compile" > /tmp/old.txt
# experiment: compile with the changed compiler
compilebench -alloc -count 20 -run "$RUN" -compile "$AFTER/pkg/tool/$A/compile" > /tmp/new.txt
$BEFORE and $AFTER are the worktree paths from step 1. $A is the platform directory
(e.g. linux_amd64). $RUN is the benchmark filter
(e.g. Template|Unicode|GoTypes|Compiler|SSA|Flate|GoParser|Reflect|Tar|XML).
9. benchstat #
benchstat compares two sets of benchmark results and reports a p-value for each metric. A p-value below 0.05
means the difference is unlikely to be noise. The more samples (set via -count in compilebench), the more
reliable the comparison.
# statistical comparison of the two runs
benchstat /tmp/old.txt /tmp/new.txt
Example output:
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
GoTypes 5.013M ± 0% 4.971M ± 0% -0.83% (p=0.000)
SSA 47.02M ± 0% 46.50M ± 0% -1.12% (p=0.000)
Flate 823.9k ± 0% 813.8k ± 0% -1.23% (p=0.000)
geomean -0.54%
What is next #
That is the tooling. In the next post, I will put it all to use on a real change and walk through the results.
“operation” means one compilation of a single package by
compilebench. Each metric is measured per compilation run. ↩︎Escape analysis determines whether a variable can stay on the stack or must be heap-allocated. The decision is per variable, not per code path. See FAQ: heap vs stack. ↩︎