Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Benchmarking — Measuring What Matters 🟡
基准测试:衡量真正重要的东西 🟡

What you’ll learn:
本章将学到什么:

  • Why naive timing with Instant::now() produces unreliable results
    为什么拿 Instant::now() 直接计时,结果往往靠不住
  • Statistical benchmarking with Criterion.rs and the lighter Divan alternative
    如何用 Criterion.rs 做统计学意义上的基准测试,以及更轻量的 Divan 替代方案
  • Profiling hot spots with perf, flamegraphs, and PGO
    如何用 perf、火焰图和 PGO 分析热点
  • Setting up continuous benchmarking in CI to catch regressions automatically
    如何在 CI 里持续跑基准测试,自动抓性能回退

Cross-references: Release Profiles — once you find the hot spot, optimize the binary · CI/CD Pipeline — benchmark job in the pipeline · Code Coverage — coverage tells you what’s tested, benchmarks tell you what’s fast
交叉阅读: 发布配置 负责在找到热点之后继续压性能;CI/CD 流水线 会把 benchmark 任务放进流水线;代码覆盖率 讲的是“哪里测到了”,基准测试讲的是“哪里快、哪里慢”。

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” — Donald Knuth
“大约 97% 的时候,都应该忘掉那些细枝末节的小效率问题;过早优化是万恶之源。但那关键的 3%,又绝不能放过。”—— Donald Knuth

The hard part isn’t writing benchmarks — it’s writing benchmarks that produce meaningful, reproducible, actionable numbers. This chapter covers the tools and techniques that get you from “it seems fast” to “we have statistical evidence that PR #347 regressed parsing throughput by 4.2%.”
真正难的不是把 benchmark 写出来,而是写出 有意义、可复现、能指导行动 的 benchmark。本章要解决的,就是怎么从“感觉好像挺快”走到“已经有统计证据表明 PR #347 让解析吞吐下降了 4.2%”。

Why Not std::time::Instant?
为什么不能只靠 std::time::Instant

The temptation:
很多人一开始都很容易这么写:

// ❌ Naive benchmarking — unreliable results
use std::time::Instant;

fn main() {
    let start = Instant::now();
    let result = parse_device_query_output(&sample_data);
    let elapsed = start.elapsed();
    println!("Parsing took {:?}", elapsed);
    // Problem 1: Compiler may optimize away `result` (dead code elimination)
    // Problem 2: Single sample — no statistical significance
    // Problem 3: CPU frequency scaling, thermal throttling, other processes
    // Problem 4: Cold cache vs warm cache not controlled
}

Problems with manual timing:
手工计时的问题主要有这些:

  1. Dead code elimination — the compiler may skip the computation entirely if the result isn’t used.
    1. 死代码消除:如果结果没真正参与后续逻辑,编译器可能直接把计算优化没了。
  2. No warm-up — the first run includes cache misses, page faults, and lazy initialization noise.
    2. 没有预热:第一次运行通常混着缓存未命中、页错误和延迟初始化噪音。
  3. No statistical analysis — a single measurement tells you nothing about variance, outliers, or confidence intervals.
    3. 没有统计分析:单次测量几乎说明不了方差、异常值和置信区间。
  4. No regression detection — you can’t compare against previous runs in a stable way.
    4. 无法稳定识别回退:没法和历史结果做可靠对比。

Criterion.rs — Statistical Benchmarking
Criterion.rs:统计学基准测试

Criterion.rs is the de facto standard for Rust micro-benchmarks. It uses statistical methods to produce reliable measurements and detects performance regressions automatically.
Criterion.rs 基本上就是 Rust 微基准测试的事实标准。它会通过统计方法生成更可靠的测量结果,还能自动识别性能回退。

Setup:
基本配置:

# Cargo.toml
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports", "cargo_bench_support"] }

[[bench]]
name = "parsing_bench"
harness = false  # Use Criterion's harness, not the built-in test harness

A complete benchmark:
一个完整的 benchmark:

#![allow(unused)]
fn main() {
// benches/parsing_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};

/// Data type for parsed GPU information
#[derive(Debug, Clone)]
struct GpuInfo {
    index: u32,
    name: String,
    temp_c: u32,
    power_w: f64,
}

/// The function under test — simulate parsing device-query CSV output
fn parse_gpu_csv(input: &str) -> Vec<GpuInfo> {
    input
        .lines()
        .filter(|line| !line.starts_with('#'))
        .filter_map(|line| {
            let fields: Vec<&str> = line.split(", ").collect();
            if fields.len() >= 4 {
                Some(GpuInfo {
                    index: fields[0].parse().ok()?,
                    name: fields[1].to_string(),
                    temp_c: fields[2].parse().ok()?,
                    power_w: fields[3].parse().ok()?,
                })
            } else {
                None
            }
        })
        .collect()
}

fn bench_parse_gpu_csv(c: &mut Criterion) {
    // Representative test data
    let small_input = "0, Acme Accel-V1-80GB, 32, 65.5\n\
                       1, Acme Accel-V1-80GB, 34, 67.2\n";

    let large_input = (0..64)
        .map(|i| format!("{i}, Acme Accel-X1-80GB, {}, {:.1}\n", 30 + i % 20, 60.0 + i as f64))
        .collect::<String>();

    c.bench_function("parse_2_gpus", |b| {
        b.iter(|| parse_gpu_csv(black_box(small_input)))
    });

    c.bench_function("parse_64_gpus", |b| {
        b.iter(|| parse_gpu_csv(black_box(&large_input)))
    });
}

criterion_group!(benches, bench_parse_gpu_csv);
criterion_main!(benches);
}

Running and reading results:
运行方式和结果解读:

# Run all benchmarks
cargo bench

# Run a specific benchmark by name
cargo bench -- parse_64

# Output:
# parse_2_gpus        time:   [1.2345 µs  1.2456 µs  1.2578 µs]
#                      ▲            ▲           ▲
#                      │       confidence interval
#                   lower 95%    median    upper 95%
#
# parse_64_gpus       time:   [38.123 µs  38.456 µs  38.812 µs]
#                     change: [-1.2345% -0.5678% +0.1234%] (p = 0.12 > 0.05)
#                     No change in performance detected.

What black_box() does: It’s a compiler hint that prevents dead-code elimination and over-aggressive constant folding. The compiler cannot see through black_box, so it must actually compute the result.
black_box() 是干什么的:它相当于给编译器一个“别瞎优化”的提示。这样编译器就没法把测量目标直接折叠掉,必须老老实实把计算做完。

Parameterized Benchmarks and Benchmark Groups
参数化 benchmark 与分组测试

Compare multiple implementations or input sizes:
如果想比较不同实现,或者比较不同输入规模,就可以用参数化 benchmark。

#![allow(unused)]
fn main() {
// benches/comparison_bench.rs
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId, Throughput};

fn bench_parsing_strategies(c: &mut Criterion) {
    let mut group = c.benchmark_group("csv_parsing");

    // Test across different input sizes
    for num_gpus in [1, 8, 32, 64, 128] {
        let input = generate_gpu_csv(num_gpus);

        // Set throughput for bytes-per-second reporting
        group.throughput(Throughput::Bytes(input.len() as u64));

        group.bench_with_input(
            BenchmarkId::new("split_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_split(input)),
        );

        group.bench_with_input(
            BenchmarkId::new("regex_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_regex(input)),
        );

        group.bench_with_input(
            BenchmarkId::new("nom_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_nom(input)),
        );
    }
    group.finish();
}

criterion_group!(benches, bench_parsing_strategies);
criterion_main!(benches);
}

Output: Criterion generates an HTML report at target/criterion/report/index.html with violin plots, comparison charts, and regression analysis.
输出结果:Criterion 会在 target/criterion/report/index.html 生成 HTML 报告,里面有小提琴图、对比图和回归分析,浏览器里看非常直观。

Divan — A Lighter Alternative
Divan:更轻量的替代方案

Divan is a newer benchmarking framework that uses attribute macros instead of Criterion’s macro DSL:
Divan 是一个更新、更轻的 benchmark 框架,它主要靠 attribute macro,而不是 Criterion 那一套宏 DSL。

# Cargo.toml
[dev-dependencies]
divan = "0.1"

[[bench]]
name = "parsing_bench"
harness = false
// benches/parsing_bench.rs
use divan::black_box;

const SMALL_INPUT: &str = "0, Acme Accel-V1-80GB, 32, 65.5\n\
                          1, Acme Accel-V1-80GB, 34, 67.2\n";

fn generate_gpu_csv(n: usize) -> String {
    (0..n)
        .map(|i| format!("{i}, Acme Accel-X1-80GB, {}, {:.1}\n", 30 + i % 20, 60.0 + i as f64))
        .collect()
}

fn main() {
    divan::main();
}

#[divan::bench]
fn parse_2_gpus() -> Vec<GpuInfo> {
    parse_gpu_csv(black_box(SMALL_INPUT))
}

#[divan::bench(args = [1, 8, 32, 64, 128])]
fn parse_n_gpus(n: usize) -> Vec<GpuInfo> {
    let input = generate_gpu_csv(n);
    parse_gpu_csv(black_box(&input))
}

// Divan output is a clean table:
// ╰─ parse_2_gpus   fastest  │ slowest  │ median   │ mean     │ samples │ iters
//                   1.234 µs │ 1.567 µs │ 1.345 µs │ 1.350 µs │ 100     │ 1600

When to choose Divan over Criterion:
什么时候选 Divan:

  • Simpler API (attribute macros, less boilerplate)
    API 更简单,样板代码更少。
  • Faster compilation (fewer dependencies)
    依赖更少,编译更快。
  • Good for quick perf checks during development
    适合开发过程里的快速性能检查。

When to choose Criterion:
什么时候选 Criterion:

  • Statistical regression detection across runs
    需要跨运行做统计学回归分析。
  • HTML reports with charts
    需要图表化 HTML 报告。
  • Established ecosystem, more CI integrations
    生态更成熟,CI 集成也更多。

Profiling with perf and Flamegraphs
perf 和火焰图做性能剖析

Benchmarks tell you how fast — profiling tells you where the time goes.
benchmark 告诉的是“有多快”,profiling 告诉的是“时间到底花在哪”。

# Step 1: Build with debug info (release speed, debug symbols)
cargo build --release
# Ensure debug info is available:
# [profile.release]
# debug = true          # Add this temporarily for profiling

# Step 2: Record with perf
perf record --call-graph=dwarf ./target/release/diag_tool --run-diagnostics

# Step 3: Generate a flamegraph
# Install: cargo install flamegraph
# Install: cargo install addr2line --features=bin (optional, speedup cargo-flamegraph)
cargo flamegraph --root -- --run-diagnostics
# Opens an interactive SVG flamegraph

# Alternative: use perf + inferno
perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

Reading a flamegraph:
火焰图怎么看:

  • Width = time spent in that function
    宽度越大,说明函数耗时越多。
  • Height = call stack depth
    高度表示调用栈深度,本身不等于更慢。
  • Bottom = entry point, Top = leaf functions doing actual work
    底部是入口,顶部通常是真正干活的叶子函数。
  • Look for wide plateaus at the top — those are your hot spots
    盯着顶部那些又宽又平的块看,热点大概率就在那里。

Profile-guided optimization (PGO):
基于 profile 的优化,PGO:

# Step 1: Build with instrumentation
RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" cargo build --release

# Step 2: Run representative workloads
./target/release/diag_tool --run-full   # generates profiling data

# Step 3: Merge profiling data
# Use the llvm-profdata that matches rustc's LLVM version:
# $(rustc --print sysroot)/lib/rustlib/x86_64-unknown-linux-gnu/bin/llvm-profdata
# Or if llvm-tools is installed: rustup component add llvm-tools
llvm-profdata merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data/

# Step 4: Rebuild with profiling feedback
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" cargo build --release
# Typical improvement: 5-20% for compute-bound code (parsing, crypto, codegen).
# I/O-bound or syscall-heavy code will see much less benefit.

Tip: Before spending time on PGO, ensure your release profile already has LTO enabled — it typically delivers a bigger win for less effort.
建议:在 PGO 上头之前,先确认 release profile 里的 LTO 已经开起来了。很多时候 LTO 的收益更大,成本还更低。

hyperfine — Quick End-to-End Timing
hyperfine:快速整体验时

hyperfine benchmarks whole commands rather than individual functions. It is perfect for measuring overall binary performance:
hyperfine 测的是整条命令,而不是单个函数。所以它特别适合看二进制整体执行性能。

# Install
cargo install hyperfine
# Or: sudo apt install hyperfine  (Ubuntu 23.04+)

# Basic benchmark
hyperfine './target/release/diag_tool --run-diagnostics'

# Compare two implementations
hyperfine './target/release/diag_tool_v1 --run-diagnostics' \
          './target/release/diag_tool_v2 --run-diagnostics'

# Warm-up runs + minimum iterations
hyperfine --warmup 3 --min-runs 10 './target/release/diag_tool --run-all'

# Export results as JSON for CI comparison
hyperfine --export-json bench.json './target/release/diag_tool --run-all'

When to use hyperfine vs Criterion:
hyperfine 和 Criterion 各自适合什么:

  • hyperfine: whole-binary timing, before/after refactor comparisons, I/O-heavy workloads
    hyperfine:测整机耗时,适合重构前后对比、也适合 IO 偏重的任务。
  • Criterion: individual functions, micro-benchmarks, statistical regression detection
    Criterion:测单函数和微基准,更适合做统计学回归检测。

Continuous Benchmarking in CI
在 CI 里持续跑 benchmark

Detect performance regressions before they ship:
把性能回退挡在发版之前。

# .github/workflows/bench.yml
name: Benchmarks

on:
  pull_request:
    paths: ['**/*.rs', 'Cargo.toml', 'Cargo.lock']

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: dtolnay/rust-toolchain@stable

      - name: Run benchmarks
        # Requires criterion = { features = ["cargo_bench_support"] } for --output-format
        run: cargo bench -- --output-format bencher | tee bench_output.txt

      - name: Store benchmark result
        uses: benchmark-action/github-action-benchmark@v1
        with:
          tool: 'cargo'
          output-file-path: bench_output.txt
          github-token: ${{ secrets.GITHUB_TOKEN }}
          auto-push: true
          alert-threshold: '120%'    # Alert if 20% slower
          comment-on-alert: true
          fail-on-alert: true        # Block PR if regression detected

Key CI considerations:
CI 里跑 benchmark 要注意:

  • Use dedicated benchmark runners for consistent results
    最好用专门的 runner,否则噪音很大。
  • Pin the runner to a specific machine type if using cloud CI
    云上 CI 尽量锁定机型。
  • Store historical data to detect gradual regressions
    保存历史数据,方便发现缓慢恶化。
  • Set thresholds based on workload tolerance
    阈值别瞎定,得按业务容忍度来。

Application: Parsing Performance
应用场景:解析性能

The project has several performance-sensitive parsing paths that would benefit from benchmarks:
当前工程里有几条对性能很敏感的解析路径,很适合优先补 benchmark。

Parsing Hot Spot
解析热点
CrateWhy It Matters
为什么重要
accelerator-query CSV/XML output
accelerator-query 的 CSV/XML 输出
device_diagCalled per-GPU, up to 8× per run
每张 GPU 都要调,单次运行最多重复 8 次。
Sensor event parsing
传感器事件解析
event_logThousands of records on busy servers
繁忙服务器上动不动就上千条记录。
PCIe topology JSON
PCIe 拓扑 JSON
topology_libComplex nested structures, golden-file validated
结构复杂,嵌套深,还已经有 golden file 测试资源。
Report JSON serialization
报告 JSON 序列化
diag_frameworkFinal report output, size-sensitive
最终报告输出,对体积和耗时都敏感。
Config JSON loading
配置 JSON 加载
config_loaderStartup latency
直接影响启动延迟。

Recommended first benchmark — the topology parser, which already has golden-file test data:
最推荐先做的 benchmark 是拓扑解析器,因为它已经有现成的 golden file 测试数据。

#![allow(unused)]
fn main() {
// topology_lib/benches/parse_bench.rs (proposed)
use criterion::{criterion_group, criterion_main, Criterion, Throughput};
use std::fs;

fn bench_topology_parse(c: &mut Criterion) {
    let mut group = c.benchmark_group("topology_parse");

    for golden_file in ["S2001", "S1015", "S1035", "S1080"] {
        let path = format!("tests/test_data/{golden_file}.json");
        let data = fs::read_to_string(&path).expect("golden file not found");
        group.throughput(Throughput::Bytes(data.len() as u64));

        group.bench_function(golden_file, |b| {
            b.iter(|| {
                topology_lib::TopologyProfile::from_json_str(
                    criterion::black_box(&data)
                )
            });
        });
    }
    group.finish();
}

criterion_group!(benches, bench_topology_parse);
criterion_main!(benches);
}

Try It Yourself
动手试一试

  1. Write a Criterion benchmark: Pick any parsing function in your codebase. Create a benches/ directory, set up a Criterion benchmark that measures throughput in bytes/second. Run cargo bench and examine the HTML report.
    写一个 Criterion benchmark:在代码库里随便挑一个解析函数,新建 benches/ 目录,做一个能统计 bytes/s 吞吐的 benchmark,跑 cargo bench,再打开 HTML 报告看看。

  2. Generate a flamegraph: Build your project with debug = true in [profile.release], then run cargo flamegraph -- <your-args>. Identify the three widest stacks at the top of the flamegraph.
    生成一张火焰图:在 [profile.release] 里临时加上 debug = true,然后运行 cargo flamegraph -- <参数>,找出顶部最宽的三个调用栈。

  3. Compare with hyperfine: Install hyperfine and benchmark the overall execution time of your binary with different flags. Compare it to the per-function times from Criterion. Where does the time go that Criterion doesn’t see?
    再和 hyperfine 对比:安装 hyperfine,分别测不同参数下的整机耗时,再和 Criterion 的函数级耗时对照。注意那些 Criterion 看不到、但整机时间里确实存在的部分,例如 IO、系统调用和进程启动。

Benchmark Tool Selection
基准测试工具选择

flowchart TD
    START["Want to measure performance?<br/>想测性能吗?"] --> WHAT{"What level?<br/>测哪个层次?"}
    
    WHAT -->|"Single function<br/>单个函数"| CRITERION["Criterion.rs<br/>Statistical, regression detection<br/>统计分析 + 回归检测"]
    WHAT -->|"Quick function check<br/>快速函数检查"| DIVAN["Divan<br/>Lighter, attribute macros<br/>更轻量"]
    WHAT -->|"Whole binary<br/>整个二进制"| HYPERFINE["hyperfine<br/>End-to-end, wall-clock<br/>整体验时"]
    WHAT -->|"Find hot spots<br/>找热点"| PERF["perf + flamegraph<br/>CPU sampling profiler<br/>采样剖析"]
    
    CRITERION --> CI_BENCH["Continuous benchmarking<br/>in GitHub Actions<br/>持续基准测试"]
    PERF --> OPTIMIZE["Profile-Guided<br/>Optimization (PGO)<br/>PGO 优化"]
    
    style CRITERION fill:#91e5a3,color:#000
    style DIVAN fill:#91e5a3,color:#000
    style HYPERFINE fill:#e3f2fd,color:#000
    style PERF fill:#ffd43b,color:#000
    style CI_BENCH fill:#e3f2fd,color:#000
    style OPTIMIZE fill:#ffd43b,color:#000

🏋️ Exercises
🏋️ 练习

🟢 Exercise 1: First Criterion Benchmark
🟢 练习 1:第一份 Criterion benchmark

Create a crate with a function that sorts a Vec<u64> of 10,000 random elements. Write a Criterion benchmark for it, then switch to .sort_unstable() and observe the performance difference in the HTML report.
创建一个 crate,写一个函数去排序 10,000 个随机 u64。给它做一个 Criterion benchmark,然后把 .sort() 换成 .sort_unstable(),在 HTML 报告里观察性能差异。

Solution 参考答案
# Cargo.toml
[[bench]]
name = "sort_bench"
harness = false

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
rand = "0.8"
#![allow(unused)]
fn main() {
// benches/sort_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::Rng;

fn generate_data(n: usize) -> Vec<u64> {
    let mut rng = rand::thread_rng();
    (0..n).map(|_| rng.gen()).collect()
}

fn bench_sort(c: &mut Criterion) {
    let mut group = c.benchmark_group("sort-10k");
    
    group.bench_function("stable", |b| {
        b.iter_batched(
            || generate_data(10_000),
            |mut data| { data.sort(); black_box(&data); },
            criterion::BatchSize::SmallInput,
        )
    });
    
    group.bench_function("unstable", |b| {
        b.iter_batched(
            || generate_data(10_000),
            |mut data| { data.sort_unstable(); black_box(&data); },
            criterion::BatchSize::SmallInput,
        )
    });
    
    group.finish();
}

criterion_group!(benches, bench_sort);
criterion_main!(benches);
}
cargo bench
open target/criterion/sort-10k/report/index.html

🟡 Exercise 2: Flamegraph Hot Spot
🟡 练习 2:火焰图热点分析

Build a project with debug = true in [profile.release], then generate a flamegraph. Identify the top 3 widest stacks.
[profile.release] 里加 debug = true,重新构建项目并生成火焰图,再找出最宽的三个调用栈。

Solution 参考答案
# Cargo.toml
[profile.release]
debug = true  # Keep symbols for flamegraph
cargo install flamegraph
cargo flamegraph --release -- <your-args>
# Opens flamegraph.svg in browser
# The widest stacks at the top are your hot spots

Key Takeaways
本章要点

  • Never benchmark with Instant::now() — use Criterion.rs for statistical rigor and regression detection
    别再拿 Instant::now() 当正式 benchmark 了,Criterion 才能提供更像样的统计结果和回归检测。
  • black_box() prevents the compiler from optimizing away your benchmark target
    black_box() 的任务就是防止编译器把被测逻辑直接优化掉。
  • hyperfine measures wall-clock time for the whole binary; Criterion measures individual functions — use both
    hyperfine 测整机耗时,Criterion 测函数级性能,两者最好配合使用。
  • Flamegraphs show where time is spent; benchmarks show how much time is spent
    火焰图负责告诉位置,benchmark 负责告诉量级。
  • Continuous benchmarking in CI catches performance regressions before they ship
    把 benchmark 放进 CI,很多性能回退在合入前就能被逮住。