Benchmarking — Measuring What Matters 🟡
基准测试：衡量真正重要的东西 🟡

What you’ll learn:
本章将学到什么：

Why naive timing with Instant::now() produces unreliable results
为什么拿 Instant::now() 直接计时，结果往往靠不住

Statistical benchmarking with Criterion.rs and the lighter Divan alternative
如何用 Criterion.rs 做统计学意义上的基准测试，以及更轻量的 Divan 替代方案

Profiling hot spots with perf, flamegraphs, and PGO
如何用 perf、火焰图和 PGO 分析热点

Setting up continuous benchmarking in CI to catch regressions automatically
如何在 CI 里持续跑基准测试，自动抓性能回退

Cross-references: Release Profiles — once you find the hot spot, optimize the binary · CI/CD Pipeline — benchmark job in the pipeline · Code Coverage — coverage tells you what’s tested, benchmarks tell you what’s fast
交叉阅读： 发布配置负责在找到热点之后继续压性能；CI/CD 流水线会把 benchmark 任务放进流水线；代码覆盖率讲的是“哪里测到了”，基准测试讲的是“哪里快、哪里慢”。

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” — Donald Knuth
“大约 97% 的时候，都应该忘掉那些细枝末节的小效率问题；过早优化是万恶之源。但那关键的 3%，又绝不能放过。”—— Donald Knuth

The hard part isn’t writing benchmarks — it’s writing benchmarks that produce meaningful, reproducible, actionable numbers. This chapter covers the tools and techniques that get you from “it seems fast” to “we have statistical evidence that PR #347 regressed parsing throughput by 4.2%.”
真正难的不是把 benchmark 写出来，而是写出 有意义、可复现、能指导行动 的 benchmark。本章要解决的，就是怎么从“感觉好像挺快”走到“已经有统计证据表明 PR #347 让解析吞吐下降了 4.2%”。

Why Not `std::time::Instant`?
为什么不能只靠 `std::time::Instant`？

The temptation:
很多人一开始都很容易这么写：

// ❌ Naive benchmarking — unreliable results
use std::time::Instant;

fn main() {
    let start = Instant::now();
    let result = parse_device_query_output(&sample_data);
    let elapsed = start.elapsed();
    println!("Parsing took {:?}", elapsed);
    // Problem 1: Compiler may optimize away `result` (dead code elimination)
    // Problem 2: Single sample — no statistical significance
    // Problem 3: CPU frequency scaling, thermal throttling, other processes
    // Problem 4: Cold cache vs warm cache not controlled
}

Problems with manual timing:
手工计时的问题主要有这些：

Dead code elimination — the compiler may skip the computation entirely if the result isn’t used.
1. 死代码消除：如果结果没真正参与后续逻辑，编译器可能直接把计算优化没了。
No warm-up — the first run includes cache misses, page faults, and lazy initialization noise.
2. 没有预热：第一次运行通常混着缓存未命中、页错误和延迟初始化噪音。
No statistical analysis — a single measurement tells you nothing about variance, outliers, or confidence intervals.
3. 没有统计分析：单次测量几乎说明不了方差、异常值和置信区间。
No regression detection — you can’t compare against previous runs in a stable way.
4. 无法稳定识别回退：没法和历史结果做可靠对比。

Criterion.rs — Statistical Benchmarking
Criterion.rs：统计学基准测试

Criterion.rs is the de facto standard for Rust micro-benchmarks. It uses statistical methods to produce reliable measurements and detects performance regressions automatically.
Criterion.rs 基本上就是 Rust 微基准测试的事实标准。它会通过统计方法生成更可靠的测量结果，还能自动识别性能回退。

Setup:
基本配置：

# Cargo.toml
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports", "cargo_bench_support"] }

[[bench]]
name = "parsing_bench"
harness = false  # Use Criterion's harness, not the built-in test harness

A complete benchmark:
一个完整的 benchmark：

#![allow(unused)]
fn main() {
// benches/parsing_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};

/// Data type for parsed GPU information
#[derive(Debug, Clone)]
struct GpuInfo {
    index: u32,
    name: String,
    temp_c: u32,
    power_w: f64,
}

/// The function under test — simulate parsing device-query CSV output
fn parse_gpu_csv(input: &str) -> Vec<GpuInfo> {
    input
        .lines()
        .filter(|line| !line.starts_with('#'))
        .filter_map(|line| {
            let fields: Vec<&str> = line.split(", ").collect();
            if fields.len() >= 4 {
                Some(GpuInfo {
                    index: fields[0].parse().ok()?,
                    name: fields[1].to_string(),
                    temp_c: fields[2].parse().ok()?,
                    power_w: fields[3].parse().ok()?,
                })
            } else {
                None
            }
        })
        .collect()
}

fn bench_parse_gpu_csv(c: &mut Criterion) {
    // Representative test data
    let small_input = "0, Acme Accel-V1-80GB, 32, 65.5\n\
                       1, Acme Accel-V1-80GB, 34, 67.2\n";

    let large_input = (0..64)
        .map(|i| format!("{i}, Acme Accel-X1-80GB, {}, {:.1}\n", 30 + i % 20, 60.0 + i as f64))
        .collect::<String>();

    c.bench_function("parse_2_gpus", |b| {
        b.iter(|| parse_gpu_csv(black_box(small_input)))
    });

    c.bench_function("parse_64_gpus", |b| {
        b.iter(|| parse_gpu_csv(black_box(&large_input)))
    });
}

criterion_group!(benches, bench_parse_gpu_csv);
criterion_main!(benches);
}

Running and reading results:
运行方式和结果解读：

# Run all benchmarks
cargo bench

# Run a specific benchmark by name
cargo bench -- parse_64

# Output:
# parse_2_gpus        time:   [1.2345 µs  1.2456 µs  1.2578 µs]
#                      ▲            ▲           ▲
#                      │       confidence interval
#                   lower 95%    median    upper 95%
#
# parse_64_gpus       time:   [38.123 µs  38.456 µs  38.812 µs]
#                     change: [-1.2345% -0.5678% +0.1234%] (p = 0.12 > 0.05)
#                     No change in performance detected.

What black_box() does: It’s a compiler hint that prevents dead-code elimination and over-aggressive constant folding. The compiler cannot see through black_box, so it must actually compute the result.
black_box() 是干什么的：它相当于给编译器一个“别瞎优化”的提示。这样编译器就没法把测量目标直接折叠掉，必须老老实实把计算做完。

Parameterized Benchmarks and Benchmark Groups
参数化 benchmark 与分组测试

Compare multiple implementations or input sizes:
如果想比较不同实现，或者比较不同输入规模，就可以用参数化 benchmark。

#![allow(unused)]
fn main() {
// benches/comparison_bench.rs
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId, Throughput};

fn bench_parsing_strategies(c: &mut Criterion) {
    let mut group = c.benchmark_group("csv_parsing");

    // Test across different input sizes
    for num_gpus in [1, 8, 32, 64, 128] {
        let input = generate_gpu_csv(num_gpus);

        // Set throughput for bytes-per-second reporting
        group.throughput(Throughput::Bytes(input.len() as u64));

        group.bench_with_input(
            BenchmarkId::new("split_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_split(input)),
        );

        group.bench_with_input(
            BenchmarkId::new("regex_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_regex(input)),
        );

        group.bench_with_input(
            BenchmarkId::new("nom_based", num_gpus),
            &input,
            |b, input| b.iter(|| parse_nom(input)),
        );
    }
    group.finish();
}

criterion_group!(benches, bench_parsing_strategies);
criterion_main!(benches);
}

Output: Criterion generates an HTML report at target/criterion/report/index.html with violin plots, comparison charts, and regression analysis.
输出结果：Criterion 会在 target/criterion/report/index.html 生成 HTML 报告，里面有小提琴图、对比图和回归分析，浏览器里看非常直观。

Divan — A Lighter Alternative
Divan：更轻量的替代方案

Divan is a newer benchmarking framework that uses attribute macros instead of Criterion’s macro DSL:
Divan 是一个更新、更轻的 benchmark 框架，它主要靠 attribute macro，而不是 Criterion 那一套宏 DSL。

# Cargo.toml
[dev-dependencies]
divan = "0.1"

[[bench]]
name = "parsing_bench"
harness = false

// benches/parsing_bench.rs
use divan::black_box;

const SMALL_INPUT: &str = "0, Acme Accel-V1-80GB, 32, 65.5\n\
                          1, Acme Accel-V1-80GB, 34, 67.2\n";

fn generate_gpu_csv(n: usize) -> String {
    (0..n)
        .map(|i| format!("{i}, Acme Accel-X1-80GB, {}, {:.1}\n", 30 + i % 20, 60.0 + i as f64))
        .collect()
}

fn main() {
    divan::main();
}

#[divan::bench]
fn parse_2_gpus() -> Vec<GpuInfo> {
    parse_gpu_csv(black_box(SMALL_INPUT))
}

#[divan::bench(args = [1, 8, 32, 64, 128])]
fn parse_n_gpus(n: usize) -> Vec<GpuInfo> {
    let input = generate_gpu_csv(n);
    parse_gpu_csv(black_box(&input))
}

// Divan output is a clean table:
// ╰─ parse_2_gpus   fastest  │ slowest  │ median   │ mean     │ samples │ iters
//                   1.234 µs │ 1.567 µs │ 1.345 µs │ 1.350 µs │ 100     │ 1600

When to choose Divan over Criterion:
什么时候选 Divan：

Simpler API (attribute macros, less boilerplate)
API 更简单，样板代码更少。
Faster compilation (fewer dependencies)
依赖更少，编译更快。
Good for quick perf checks during development
适合开发过程里的快速性能检查。

When to choose Criterion:
什么时候选 Criterion：

Statistical regression detection across runs
需要跨运行做统计学回归分析。
HTML reports with charts
需要图表化 HTML 报告。
Established ecosystem, more CI integrations
生态更成熟，CI 集成也更多。

Profiling with `perf` and Flamegraphs
用 `perf` 和火焰图做性能剖析

Benchmarks tell you how fast — profiling tells you where the time goes.
benchmark 告诉的是“有多快”，profiling 告诉的是“时间到底花在哪”。

# Step 1: Build with debug info (release speed, debug symbols)
cargo build --release
# Ensure debug info is available:
# [profile.release]
# debug = true          # Add this temporarily for profiling

# Step 2: Record with perf
perf record --call-graph=dwarf ./target/release/diag_tool --run-diagnostics

# Step 3: Generate a flamegraph
# Install: cargo install flamegraph
# Install: cargo install addr2line --features=bin (optional, speedup cargo-flamegraph)
cargo flamegraph --root -- --run-diagnostics
# Opens an interactive SVG flamegraph

# Alternative: use perf + inferno
perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

Reading a flamegraph:
火焰图怎么看：

Width = time spent in that function
宽度越大，说明函数耗时越多。
Height = call stack depth
高度表示调用栈深度，本身不等于更慢。
Bottom = entry point, Top = leaf functions doing actual work
底部是入口，顶部通常是真正干活的叶子函数。
Look for wide plateaus at the top — those are your hot spots
盯着顶部那些又宽又平的块看，热点大概率就在那里。

Profile-guided optimization (PGO):
基于 profile 的优化，PGO：

# Step 1: Build with instrumentation
RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" cargo build --release

# Step 2: Run representative workloads
./target/release/diag_tool --run-full   # generates profiling data

# Step 3: Merge profiling data
# Use the llvm-profdata that matches rustc's LLVM version:
# $(rustc --print sysroot)/lib/rustlib/x86_64-unknown-linux-gnu/bin/llvm-profdata
# Or if llvm-tools is installed: rustup component add llvm-tools
llvm-profdata merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data/

# Step 4: Rebuild with profiling feedback
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" cargo build --release
# Typical improvement: 5-20% for compute-bound code (parsing, crypto, codegen).
# I/O-bound or syscall-heavy code will see much less benefit.

Tip: Before spending time on PGO, ensure your release profile already has LTO enabled — it typically delivers a bigger win for less effort.
建议：在 PGO 上头之前，先确认 release profile 里的 LTO 已经开起来了。很多时候 LTO 的收益更大，成本还更低。

`hyperfine` — Quick End-to-End Timing
`hyperfine`：快速整体验时

hyperfine benchmarks whole commands rather than individual functions. It is perfect for measuring overall binary performance:
hyperfine 测的是整条命令，而不是单个函数。所以它特别适合看二进制整体执行性能。

# Install
cargo install hyperfine
# Or: sudo apt install hyperfine  (Ubuntu 23.04+)

# Basic benchmark
hyperfine './target/release/diag_tool --run-diagnostics'

# Compare two implementations
hyperfine './target/release/diag_tool_v1 --run-diagnostics' \
          './target/release/diag_tool_v2 --run-diagnostics'

# Warm-up runs + minimum iterations
hyperfine --warmup 3 --min-runs 10 './target/release/diag_tool --run-all'

# Export results as JSON for CI comparison
hyperfine --export-json bench.json './target/release/diag_tool --run-all'

When to use hyperfine vs Criterion:
hyperfine 和 Criterion 各自适合什么：

hyperfine: whole-binary timing, before/after refactor comparisons, I/O-heavy workloads
hyperfine：测整机耗时，适合重构前后对比、也适合 IO 偏重的任务。
Criterion: individual functions, micro-benchmarks, statistical regression detection
Criterion：测单函数和微基准，更适合做统计学回归检测。

Continuous Benchmarking in CI
在 CI 里持续跑 benchmark

Detect performance regressions before they ship:
把性能回退挡在发版之前。

# .github/workflows/bench.yml
name: Benchmarks

on:
  pull_request:
    paths: ['**/*.rs', 'Cargo.toml', 'Cargo.lock']

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: dtolnay/rust-toolchain@stable

      - name: Run benchmarks
        # Requires criterion = { features = ["cargo_bench_support"] } for --output-format
        run: cargo bench -- --output-format bencher | tee bench_output.txt

      - name: Store benchmark result
        uses: benchmark-action/github-action-benchmark@v1
        with:
          tool: 'cargo'
          output-file-path: bench_output.txt
          github-token: ${{ secrets.GITHUB_TOKEN }}
          auto-push: true
          alert-threshold: '120%'    # Alert if 20% slower
          comment-on-alert: true
          fail-on-alert: true        # Block PR if regression detected

Key CI considerations:
CI 里跑 benchmark 要注意：

Use dedicated benchmark runners for consistent results
最好用专门的 runner，否则噪音很大。
Pin the runner to a specific machine type if using cloud CI
云上 CI 尽量锁定机型。
Store historical data to detect gradual regressions
保存历史数据，方便发现缓慢恶化。
Set thresholds based on workload tolerance
阈值别瞎定，得按业务容忍度来。

Application: Parsing Performance
应用场景：解析性能

The project has several performance-sensitive parsing paths that would benefit from benchmarks:
当前工程里有几条对性能很敏感的解析路径，很适合优先补 benchmark。

Parsing Hot Spot 解析热点	Crate	Why It Matters 为什么重要
accelerator-query CSV/XML output accelerator-query 的 CSV/XML 输出	`device_diag`	Called per-GPU, up to 8× per run 每张 GPU 都要调，单次运行最多重复 8 次。
Sensor event parsing 传感器事件解析	`event_log`	Thousands of records on busy servers 繁忙服务器上动不动就上千条记录。
PCIe topology JSON PCIe 拓扑 JSON	`topology_lib`	Complex nested structures, golden-file validated 结构复杂，嵌套深，还已经有 golden file 测试资源。
Report JSON serialization 报告 JSON 序列化	`diag_framework`	Final report output, size-sensitive 最终报告输出，对体积和耗时都敏感。
Config JSON loading 配置 JSON 加载	`config_loader`	Startup latency 直接影响启动延迟。

Recommended first benchmark — the topology parser, which already has golden-file test data:
最推荐先做的 benchmark 是拓扑解析器，因为它已经有现成的 golden file 测试数据。

#![allow(unused)]
fn main() {
// topology_lib/benches/parse_bench.rs (proposed)
use criterion::{criterion_group, criterion_main, Criterion, Throughput};
use std::fs;

fn bench_topology_parse(c: &mut Criterion) {
    let mut group = c.benchmark_group("topology_parse");

    for golden_file in ["S2001", "S1015", "S1035", "S1080"] {
        let path = format!("tests/test_data/{golden_file}.json");
        let data = fs::read_to_string(&path).expect("golden file not found");
        group.throughput(Throughput::Bytes(data.len() as u64));

        group.bench_function(golden_file, |b| {
            b.iter(|| {
                topology_lib::TopologyProfile::from_json_str(
                    criterion::black_box(&data)
                )
            });
        });
    }
    group.finish();
}

criterion_group!(benches, bench_topology_parse);
criterion_main!(benches);
}

Try It Yourself
动手试一试

Write a Criterion benchmark: Pick any parsing function in your codebase. Create a benches/ directory, set up a Criterion benchmark that measures throughput in bytes/second. Run cargo bench and examine the HTML report.
写一个 Criterion benchmark：在代码库里随便挑一个解析函数，新建 benches/ 目录，做一个能统计 bytes/s 吞吐的 benchmark，跑 cargo bench，再打开 HTML 报告看看。
Generate a flamegraph: Build your project with debug = true in [profile.release], then run cargo flamegraph -- <your-args>. Identify the three widest stacks at the top of the flamegraph.
生成一张火焰图：在 [profile.release] 里临时加上 debug = true，然后运行 cargo flamegraph -- <参数>，找出顶部最宽的三个调用栈。
Compare with hyperfine: Install hyperfine and benchmark the overall execution time of your binary with different flags. Compare it to the per-function times from Criterion. Where does the time go that Criterion doesn’t see?
再和 hyperfine 对比：安装 hyperfine，分别测不同参数下的整机耗时，再和 Criterion 的函数级耗时对照。注意那些 Criterion 看不到、但整机时间里确实存在的部分，例如 IO、系统调用和进程启动。

Benchmark Tool Selection
基准测试工具选择

flowchart TD
    START["Want to measure performance?<br/>想测性能吗？"] --> WHAT{"What level?<br/>测哪个层次？"}
    
    WHAT -->|"Single function<br/>单个函数"| CRITERION["Criterion.rs<br/>Statistical, regression detection<br/>统计分析 + 回归检测"]
    WHAT -->|"Quick function check<br/>快速函数检查"| DIVAN["Divan<br/>Lighter, attribute macros<br/>更轻量"]
    WHAT -->|"Whole binary<br/>整个二进制"| HYPERFINE["hyperfine<br/>End-to-end, wall-clock<br/>整体验时"]
    WHAT -->|"Find hot spots<br/>找热点"| PERF["perf + flamegraph<br/>CPU sampling profiler<br/>采样剖析"]
    
    CRITERION --> CI_BENCH["Continuous benchmarking<br/>in GitHub Actions<br/>持续基准测试"]
    PERF --> OPTIMIZE["Profile-Guided<br/>Optimization (PGO)<br/>PGO 优化"]
    
    style CRITERION fill:#91e5a3,color:#000
    style DIVAN fill:#91e5a3,color:#000
    style HYPERFINE fill:#e3f2fd,color:#000
    style PERF fill:#ffd43b,color:#000
    style CI_BENCH fill:#e3f2fd,color:#000
    style OPTIMIZE fill:#ffd43b,color:#000

🏋️ Exercises
🏋️ 练习

🟢 Exercise 1: First Criterion Benchmark
🟢 练习 1：第一份 Criterion benchmark

Create a crate with a function that sorts a Vec<u64> of 10,000 random elements. Write a Criterion benchmark for it, then switch to .sort_unstable() and observe the performance difference in the HTML report.
创建一个 crate，写一个函数去排序 10,000 个随机 u64。给它做一个 Criterion benchmark，然后把 .sort() 换成 .sort_unstable()，在 HTML 报告里观察性能差异。

Solution 参考答案

# Cargo.toml
[[bench]]
name = "sort_bench"
harness = false

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
rand = "0.8"

#![allow(unused)]
fn main() {
// benches/sort_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::Rng;

fn generate_data(n: usize) -> Vec<u64> {
    let mut rng = rand::thread_rng();
    (0..n).map(|_| rng.gen()).collect()
}

fn bench_sort(c: &mut Criterion) {
    let mut group = c.benchmark_group("sort-10k");
    
    group.bench_function("stable", |b| {
        b.iter_batched(
            || generate_data(10_000),
            |mut data| { data.sort(); black_box(&data); },
            criterion::BatchSize::SmallInput,
        )
    });
    
    group.bench_function("unstable", |b| {
        b.iter_batched(
            || generate_data(10_000),
            |mut data| { data.sort_unstable(); black_box(&data); },
            criterion::BatchSize::SmallInput,
        )
    });
    
    group.finish();
}

criterion_group!(benches, bench_sort);
criterion_main!(benches);
}

cargo bench
open target/criterion/sort-10k/report/index.html

🟡 Exercise 2: Flamegraph Hot Spot
🟡 练习 2：火焰图热点分析

Build a project with debug = true in [profile.release], then generate a flamegraph. Identify the top 3 widest stacks.
在 [profile.release] 里加 debug = true，重新构建项目并生成火焰图，再找出最宽的三个调用栈。

Solution 参考答案

# Cargo.toml
[profile.release]
debug = true  # Keep symbols for flamegraph

cargo install flamegraph
cargo flamegraph --release -- <your-args>
# Opens flamegraph.svg in browser
# The widest stacks at the top are your hot spots

Key Takeaways
本章要点

Never benchmark with Instant::now() — use Criterion.rs for statistical rigor and regression detection
别再拿 Instant::now() 当正式 benchmark 了，Criterion 才能提供更像样的统计结果和回归检测。
black_box() prevents the compiler from optimizing away your benchmark target
black_box() 的任务就是防止编译器把被测逻辑直接优化掉。
hyperfine measures wall-clock time for the whole binary; Criterion measures individual functions — use both
hyperfine 测整机耗时，Criterion 测函数级性能，两者最好配合使用。
Flamegraphs show where time is spent; benchmarks show how much time is spent
火焰图负责告诉位置，benchmark 负责告诉量级。
Continuous benchmarking in CI catches performance regressions before they ship
把 benchmark 放进 CI，很多性能回退在合入前就能被逮住。

Keyboard shortcuts

Rust Engineering Practices | Rust 工程实践