Go compiler performance tooling: a step-by-step guide

Table of Contents

Recently I started looking at Go compiler performance as part of my work. I work with a large monorepo. Some code is generated, and at that scale, lots of weird edge cases pop up. The underlying reasons do not matter, but it got me curious about what cmd/compile spends its time on and whether any of it is avoidable.

The goal: make the compiler faster without changing the code it generates. The output must be byte-identical. If it is not, the change has altered the language, not the speed.

This is different from application performance work. There, the focus is the code that runs. Here, it is the tool that builds it. Profiling, measuring, and proving safety all work differently.

Two questions every change must answer #

Every compiler performance change must prove two things:

Safe: the generated code is byte-identical. Verified with toolstash -cmp over the entire standard library.
Better: a statistically significant improvement. Verified with compilebench and benchstat over at least 20 runs.

graph TD A[Change the compiler code] --> B{Exactly the same
binaries as before?} B -- identical --> C{Statistically better?} B -- differs --> D[Discard: output changed] C -- confirmed --> E[Keep] C -- inconclusive --> F[Discard: no improvement]

Which metrics to trust #

benchstat with the -alloc flag produces five metric blocks:

Metric	What it means	Deterministic?	Trust level
`allocs/op`	Number of heap allocations per operation¹	Yes	Primary evidence. Same input, same count.
`B/op`	Bytes of heap memory allocated per operation	Yes	Primary evidence. Deterministic.
`sec/op`	Wall-clock seconds per operation	No	Only with perflock on a quiet, dedicated box.
`user-sec/op`	User-space CPU seconds per operation	No	Directional only. Varies between runs.
`maxRSS/op`	Peak resident memory (RSS) per operation	No	Often an artifact. Small packages hit a floor.

allocs/op and B/op come from Go’s runtime memory counters. They are a property of the compiler and its input, not the machine. Same compiler, same source file, same count every time.

sec/op is wall-clock time: the real elapsed seconds from start to finish. It depends on what else is running on the machine, what the CPU frequency is, and whether anything is competing for CPU cache. A 0.5% allocation reduction will often show zero measurable change in sec/op, even on a quiet box. That does not mean the change is pointless. Wall time is a blunt instrument for small, fixed-percentage improvements. Allocation counts are the sharper one.

user-sec/op measures CPU time spent in user space (the program itself, excluding time the kernel spends on its behalf for I/O and system calls). It is less affected by background load than sec/op but still varies between runs because of thread scheduling and cache effects. Useful for spotting large regressions, not for proving a small win.

maxRSS/op reports the highest amount of physical memory the compiler used during one compilation. For small packages, the OS always reports the same minimum regardless of the change, so the number stops being useful. Small positive deltas are usually noise, not real regressions.

The practical upshot: for allocation-focused work, publishable, reproducible numbers can be obtained on a laptop. No server room needed.

The workflow #

The compiler README lists the official set of helpful tools, including bent for large-scale benchmarks and view-annotated-file for overlaying compiler decisions onto source code. This section walks through my typical workflow and why each tool exists.

1. git worktree #

The comparison needs two builds of the compiler side by side: one without the change (baseline) and one with (experiment). Without worktrees, switching branches means rebuilding the entire toolchain (compiler, linker, and other build tools) every time. Two git worktree directories let both builds exist at once.

The baseline worktree stays clean so its compiler binary can be used as the reference in benchmarks.

# create two worktrees from the same commit
git worktree add --detach "$BEFORE" HEAD
git worktree add --detach "$AFTER"  HEAD

2. make.bash #

Go builds itself from source. There is no way to patch just the compiler in an existing installation. make.bash (in the src/ directory) produces a runnable bin/go and a compile binary from the source tree. It needs an existing Go installation to bootstrap from, set via GOROOT_BOOTSTRAP.

Set GOTOOLCHAIN=local to prevent the built toolchain from auto-downloading a different version. Without it, the carefully built compiler gets silently replaced.

# build the toolchain from source
cd "$BEFORE/src" && GOROOT_BOOTSTRAP=$(go env GOROOT) GOTOOLCHAIN=local ./make.bash

3. perflock #

The CPU adjusts its clock speed based on load and thermal conditions, which makes sec/op vary between runs. perflock pins the frequency. A quiet, dedicated box removes remaining cache and scheduler noise. allocs/op and B/op are deterministic regardless, so perflock is optional for allocation work.

# start the perflock daemon (Linux only)
sudo $(which perflock) -daemon

4. compile -bench #

The compile binary accepts a -bench flag that prints wall time for each phase (parsing, inlining, escape analysis, SSA backend, object writing). A CPU profile weights all-thread time, which makes the parallel backend look dominant. -bench shows the per-phase breakdown.

# print per-phase wall time for one package
compile -bench /dev/stdout -I "$std" -o /dev/null -p "$pkg" "${files[@]}"

compile is the binary built by make.bash, $std is a directory of prebuilt standard library imports, $pkg is the package to profile, and ${files[@]} is its source files.

5. perf stat #

On Linux, perf stat wraps the compiler and reports hardware-level metrics: instructions per cycle (IPC), page faults, branch misses, and the split between user and system time.

IPC measures how much work the CPU does per clock cycle. A value below 1.0 means the CPU is mostly waiting, usually for memory. The Go compiler sets a 128 MB heap goal at startup so that GC does not trigger until the heap reaches that size. This causes the OS to map many memory pages. This shows up as high page-fault counts and system time (time the OS spends handling those page faults, as opposed to user time where the compiler itself runs).

# hardware counters: IPC, page faults, user/sys split
perf stat compile -bench /dev/stdout -I "$std" -o /dev/null -p "$pkg" "${files[@]}"

6. go build -gcflags=-m #

Passing -gcflags=-m to go build tells the compiler to print its escape analysis² decisions: which variables stay on the stack and which escape to the heap. Once an allocation profile points to a function allocating more than expected, this flag shows why.

# show which variables escape to the heap
go build -gcflags='-m' cmd/compile 2>&1 | grep 'escapes to heap'

7. toolstash #

After applying a change and rebuilding, the only compiler available is the changed one. toolstash save snapshots the clean compiler binaries before the change. After rebuilding, toolstash -cmp runs both the saved and live compilers on the same input and diffs the output byte for byte. Zero output means byte-identical.

Save before the change, not after. If saved after, the comparison proves nothing.

# snapshot the clean compiler
toolstash save
# rebuild with the change
go install cmd/compile
# zero output = byte-identical
go build -toolexec 'toolstash -cmp' -a std

8. compilebench #

go test -bench benchmarks application code, not the compiler. compilebench measures the compiler itself, compiling standard library packages and reporting the metrics from the table above. Use -compile to point at each worktree’s compiler binary. Run baseline and changed passes sequentially, never in parallel.

# baseline: compile with the clean compiler
compilebench -alloc -count 20 -run "$RUN" -compile "$BEFORE/pkg/tool/$A/compile" > /tmp/old.txt
# experiment: compile with the changed compiler
compilebench -alloc -count 20 -run "$RUN" -compile "$AFTER/pkg/tool/$A/compile"  > /tmp/new.txt

9. benchstat #

benchstat compares two sets of benchmark results and reports a p-value for each metric. A p-value below 0.05 means the difference is unlikely to be noise. The more samples (set via -count in compilebench), the more reliable the comparison.

# statistical comparison of the two runs
benchstat /tmp/old.txt /tmp/new.txt

Example output:

                 │  old.txt   │              new.txt              │
                 │  allocs/op │  allocs/op   vs base              │
GoTypes            5.013M ± 0%   4.971M ± 0%  -0.83% (p=0.000)
SSA                47.02M ± 0%   46.50M ± 0%  -1.12% (p=0.000)
Flate              823.9k ± 0%   813.8k ± 0%  -1.23% (p=0.000)
geomean                                       -0.54%

What is next #

That is the tooling. In the next post, I will put it all to use on a real change and walk through the results.

“operation” means one compilation of a single package by compilebench. Each metric is measured per compilation run. ↩︎
Escape analysis determines whether a variable can stay on the stack or must be heap-allocated. The decision is per variable, not per code path. See FAQ: heap vs stack. ↩︎