CHAPTER 32 · QUALITY & SHIPPING

Benchmarking & Profiling

Benchmarks are first-class citizens in Go, living right next to tests. go test -bench . gives you nanosecond-precision measurements; pprof tells you where time and memory go.

Learning objectives

Write and run benchmarks with testing.B.
Understand b.N and how Go calibrates iteration counts.
Report allocations per operation with b.ReportAllocs.
Compare two benchmark runs with benchstat.
Profile CPU and memory with pprof.

Writing a benchmark

func BenchmarkAdd(b *testing.B) {
    for i := 0; i < b.N; i++ {
        _ = Add(2, 3)
    }
}

Rules (mirror tests): file ending in _test.go, function named BenchmarkXxx, single *testing.B.

Running benchmarks

go test -bench=. ./...              # run every benchmark
go test -bench=Add . -benchtime=2s  # run for 2 seconds per benchmark
go test -bench=. -count=5 .         # average over 5 runs (smooths noise)

Sample output:

BenchmarkAdd-10    1000000000    0.2570 ns/op
        //            ^^^^^^^^^^   ^^^^^^^^^^^^^
        //            b.N (iters)   ns per iteration

Understanding `b.N`

Go runs your loop with increasing b.N values until the benchmark runs long enough to get a stable measurement. Your job: put one operation you want to measure per iteration, and make sure setup work is outside the loop:

func BenchmarkJoin(b *testing.B) {
    words := strings.Split(strings.Repeat("hello ", 1000), " ")   // setup outside
    b.ResetTimer()                                                 // discount setup

    for i := 0; i < b.N; i++ {
        _ = strings.Join(words, "-")
    }
}

Reporting allocations

func BenchmarkStringConcat(b *testing.B) {
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        s := ""
        for j := 0; j < 100; j++ { s += "x" }
    }
}

Output now includes bytes-per-op and allocs-per-op:

BenchmarkStringConcat-10   100000   12000 ns/op   5200 B/op   100 allocs/op

Allocations are often where performance goes. Reducing them is usually the #1 tuning lever.

Sub-benchmarks

func BenchmarkAppend(b *testing.B) {
    for _, size := range []int{10, 100, 1000, 10000} {
        b.Run(fmt.Sprintf("size-%d", size), func(b *testing.B) {
            for i := 0; i < b.N; i++ {
                s := make([]int, 0, 0)
                for j := 0; j < size; j++ {
                    s = append(s, j)
                }
            }
        })
    }
}

`benchstat`: comparing runs

go install golang.org/x/perf/cmd/benchstat@latest

# before
go test -bench=. -count=10 . > before.txt

# make your change, then:
go test -bench=. -count=10 . > after.txt

benchstat before.txt after.txt

Output tells you the statistically significant change between runs (not just raw numbers, but p-values too). Essential for real performance work.

`pprof`

When a benchmark tells you something's slow, pprof tells you where:

# Capture CPU profile during benchmark
go test -bench=Work -cpuprofile=cpu.prof .

# Interactive explorer
go tool pprof cpu.prof
# > top
# > list FuncName
# > web          (opens a graph in your browser)

# Memory
go test -bench=Work -memprofile=mem.prof .
go tool pprof mem.prof

For a running HTTP server, import net/http/pprof (side effects auto-register /debug/pprof) and fetch profiles with go tool pprof http://localhost:8080/debug/pprof/profile.

Benchmarking tips

Close other apps. Background noise skews results.
Warm up the CPU. First iteration is often slower. -benchtime=2s or longer averages it out.
Prevent dead-code elimination. Assign results to a package-level var or use _ = result. The compiler may discard unused returns.
One thing at a time. Don't change code + env + tooling between before/after runs.
Measure, don't guess. Your intuition about perf is wrong more often than you'd like.

Check your understanding

Practice exercises

EXERCISE 1

Concat vs Builder

Write two benchmarks: one that builds a 1000-character string with +=, another with strings.Builder. Enable b.ReportAllocs() and compare.

Show solution

func BenchmarkConcat(b *testing.B) {
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        s := ""
        for j := 0; j < 1000; j++ { s += "x" }
        _ = s
    }
}

func BenchmarkBuilder(b *testing.B) {
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        var bd strings.Builder
        for j := 0; j < 1000; j++ { bd.WriteByte('x') }
        _ = bd.String()
    }
}

Expected: the Builder is 10–100x faster and does a tiny fraction of the allocations.