[{"content":"","date":null,"permalink":"https://siutsin.com/tags/compiler/","section":"Tags","summary":"","title":"Compiler"},{"content":"","date":null,"permalink":"https://siutsin.com/tags/go/","section":"Tags","summary":"","title":"Go"},{"content":"Recently I started looking at Go compiler performance as part of my work. I work with a large monorepo. Some code is generated, and at that scale, lots of weird edge cases pop up. The underlying reasons do not matter, but it got me curious about what cmd/compile spends its time on and whether any of it is avoidable.\nThe goal: make the compiler faster without changing the code it generates. The output must be byte-identical. If it is not, the change has altered the language, not the speed.\nThis is different from application performance work. There, the focus is the code that runs. Here, it is the tool that builds it. Profiling, measuring, and proving safety all work differently.\nTwo questions every change must answer #Every compiler performance change must prove two things:\nSafe: the generated code is byte-identical. Verified with toolstash -cmp over the entire standard library. Better: a statistically significant improvement. Verified with compilebench and benchstat over at least 20 runs. graph TD A[Change the compiler code] --\u003e B{Exactly the samebinaries as before?} B -- identical --\u003e C{Statistically better?} B -- differs --\u003e D[Discard: output changed] C -- confirmed --\u003e E[Keep] C -- inconclusive --\u003e F[Discard: no improvement] Which metrics to trust #benchstat with the -alloc flag produces five metric blocks:\nMetric What it means Deterministic? Trust level allocs/op Number of heap allocations per operation1 Yes Primary evidence. Same input, same count. B/op Bytes of heap memory allocated per operation Yes Primary evidence. Deterministic. sec/op Wall-clock seconds per operation No Only with perflock on a quiet, dedicated box. user-sec/op User-space CPU seconds per operation No Directional only. Varies between runs. maxRSS/op Peak resident memory (RSS) per operation No Often an artifact. Small packages hit a floor. allocs/op and B/op come from Go\u0026rsquo;s runtime memory counters. They are a property of the compiler and its input, not the machine. Same compiler, same source file, same count every time.\nsec/op is wall-clock time: the real elapsed seconds from start to finish. It depends on what else is running on the machine, what the CPU frequency is, and whether anything is competing for CPU cache. A 0.5% allocation reduction will often show zero measurable change in sec/op, even on a quiet box. That does not mean the change is pointless. Wall time is a blunt instrument for small, fixed-percentage improvements. Allocation counts are the sharper one.\nuser-sec/op measures CPU time spent in user space (the program itself, excluding time the kernel spends on its behalf for I/O and system calls). It is less affected by background load than sec/op but still varies between runs because of thread scheduling and cache effects. Useful for spotting large regressions, not for proving a small win.\nmaxRSS/op reports the highest amount of physical memory the compiler used during one compilation. For small packages, the OS always reports the same minimum regardless of the change, so the number stops being useful. Small positive deltas are usually noise, not real regressions.\nThe practical upshot: for allocation-focused work, publishable, reproducible numbers can be obtained on a laptop. No server room needed.\nThe workflow #The compiler README lists the official set of helpful tools, including bent for large-scale benchmarks and view-annotated-file for overlaying compiler decisions onto source code. This section walks through my typical workflow and why each tool exists.\n1. git worktree #The comparison needs two builds of the compiler side by side: one without the change (baseline) and one with (experiment). Without worktrees, switching branches means rebuilding the entire toolchain (compiler, linker, and other build tools) every time. Two git worktree directories let both builds exist at once.\nThe baseline worktree stays clean so its compiler binary can be used as the reference in benchmarks.\n# create two worktrees from the same commit git worktree add --detach \u0026#34;$BEFORE\u0026#34; HEAD git worktree add --detach \u0026#34;$AFTER\u0026#34; HEAD 2. make.bash #Go builds itself from source. There is no way to patch just the compiler in an existing installation. make.bash (in the src/ directory) produces a runnable bin/go and a compile binary from the source tree. It needs an existing Go installation to bootstrap from, set via GOROOT_BOOTSTRAP.\nSet GOTOOLCHAIN=local to prevent the built toolchain from auto-downloading a different version. Without it, the carefully built compiler gets silently replaced.\n# build the toolchain from source cd \u0026#34;$BEFORE/src\u0026#34; \u0026amp;\u0026amp; GOROOT_BOOTSTRAP=$(go env GOROOT) GOTOOLCHAIN=local ./make.bash 3. perflock #The CPU adjusts its clock speed based on load and thermal conditions, which makes sec/op vary between runs. perflock pins the frequency. A quiet, dedicated box removes remaining cache and scheduler noise. allocs/op and B/op are deterministic regardless, so perflock is optional for allocation work.\n# start the perflock daemon (Linux only) sudo $(which perflock) -daemon 4. compile -bench #The compile binary accepts a -bench flag that prints wall time for each phase (parsing, inlining, escape analysis, SSA backend, object writing). A CPU profile weights all-thread time, which makes the parallel backend look dominant. -bench shows the per-phase breakdown.\n# print per-phase wall time for one package compile -bench /dev/stdout -I \u0026#34;$std\u0026#34; -o /dev/null -p \u0026#34;$pkg\u0026#34; \u0026#34;${files[@]}\u0026#34; compile is the binary built by make.bash, $std is a directory of prebuilt standard library imports, $pkg is the package to profile, and ${files[@]} is its source files.\n5. perf stat #On Linux, perf stat wraps the compiler and reports hardware-level metrics: instructions per cycle (IPC), page faults, branch misses, and the split between user and system time.\nIPC measures how much work the CPU does per clock cycle. A value below 1.0 means the CPU is mostly waiting, usually for memory. The Go compiler sets a 128 MB heap goal at startup so that GC does not trigger until the heap reaches that size. This causes the OS to map many memory pages. This shows up as high page-fault counts and system time (time the OS spends handling those page faults, as opposed to user time where the compiler itself runs).\n# hardware counters: IPC, page faults, user/sys split perf stat compile -bench /dev/stdout -I \u0026#34;$std\u0026#34; -o /dev/null -p \u0026#34;$pkg\u0026#34; \u0026#34;${files[@]}\u0026#34; 6. go build -gcflags=-m #Passing -gcflags=-m to go build tells the compiler to print its escape analysis2 decisions: which variables stay on the stack and which escape to the heap. Once an allocation profile points to a function allocating more than expected, this flag shows why.\n# show which variables escape to the heap go build -gcflags=\u0026#39;-m\u0026#39; cmd/compile 2\u0026gt;\u0026amp;1 | grep \u0026#39;escapes to heap\u0026#39; 7. toolstash #After applying a change and rebuilding, the only compiler available is the changed one. toolstash save snapshots the clean compiler binaries before the change. After rebuilding, toolstash -cmp runs both the saved and live compilers on the same input and diffs the output byte for byte. Zero output means byte-identical.\nSave before the change, not after. If saved after, the comparison proves nothing.\n# snapshot the clean compiler toolstash save # rebuild with the change go install cmd/compile # zero output = byte-identical go build -toolexec \u0026#39;toolstash -cmp\u0026#39; -a std 8. compilebench #go test -bench benchmarks application code, not the compiler. compilebench measures the compiler itself, compiling standard library packages and reporting the metrics from the table above. Use -compile to point at each worktree\u0026rsquo;s compiler binary. Run baseline and changed passes sequentially, never in parallel.\n# baseline: compile with the clean compiler compilebench -alloc -count 20 -run \u0026#34;$RUN\u0026#34; -compile \u0026#34;$BEFORE/pkg/tool/$A/compile\u0026#34; \u0026gt; /tmp/old.txt # experiment: compile with the changed compiler compilebench -alloc -count 20 -run \u0026#34;$RUN\u0026#34; -compile \u0026#34;$AFTER/pkg/tool/$A/compile\u0026#34; \u0026gt; /tmp/new.txt $BEFORE and $AFTER are the worktree paths from step 1. $A is the platform directory (e.g. linux_amd64). $RUN is the benchmark filter (e.g. Template|Unicode|GoTypes|Compiler|SSA|Flate|GoParser|Reflect|Tar|XML).\n9. benchstat #benchstat compares two sets of benchmark results and reports a p-value for each metric. A p-value below 0.05 means the difference is unlikely to be noise. The more samples (set via -count in compilebench), the more reliable the comparison.\n# statistical comparison of the two runs benchstat /tmp/old.txt /tmp/new.txt Example output:\n│ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ GoTypes 5.013M ± 0% 4.971M ± 0% -0.83% (p=0.000) SSA 47.02M ± 0% 46.50M ± 0% -1.12% (p=0.000) Flate 823.9k ± 0% 813.8k ± 0% -1.23% (p=0.000) geomean -0.54% What is next #That is the tooling. In the next post, I will put it all to use on a real change and walk through the results.\n\u0026ldquo;operation\u0026rdquo; means one compilation of a single package by compilebench. Each metric is measured per compilation run.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nEscape analysis determines whether a variable can stay on the stack or must be heap-allocated. The decision is per variable, not per code path. See FAQ: heap vs stack.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"Sun, 14 Jun 2026 11:22 +0100","permalink":"https://siutsin.com/posts/go-compiler-perf-tooling/","section":"Posts","summary":"The tools behind Go compiler performance work: why byte-identical output is the safety bar, which metrics to trust, and what each tool does.","title":"Go compiler performance tooling: a step-by-step guide"},{"content":"Writing about whatever I am working on or thinking about. Some of it is technical, some of it is not.\n","date":null,"permalink":"https://siutsin.com/","section":"Over Engineering","summary":"\u003cp\u003eWriting about whatever I am working on or thinking about. Some of it is\ntechnical, some of it is not.\u003c/p\u003e","title":"Over Engineering"},{"content":"","date":null,"permalink":"https://siutsin.com/tags/performance/","section":"Tags","summary":"","title":"Performance"},{"content":"","date":null,"permalink":"https://siutsin.com/posts/","section":"Posts","summary":"","title":"Posts"},{"content":"","date":null,"permalink":"https://siutsin.com/tags/","section":"Tags","summary":"","title":"Tags"},{"content":"","date":null,"permalink":"https://siutsin.com/tags/tooling/","section":"Tags","summary":"","title":"Tooling"}]