11. Serialization, Zero-Copy, and Binary Data 🟡
# 11. 序列化、零拷贝与二进制数据 🟡

What you’ll learn:
本章将学到什么：

serde fundamentals: derive macros, attributes, and enum representations
serde 的基础：derive 宏、属性和枚举表示方式

Zero-copy deserialization for high-performance read-heavy workloads
面向高读负载场景的零拷贝反序列化

The serde format ecosystem (JSON, TOML, bincode, MessagePack)
serde 生态里的各种格式：JSON、TOML、bincode、MessagePack 等

Binary data handling with repr(C), zerocopy, and bytes::Bytes
如何用 repr(C)、zerocopy 和 bytes::Bytes 处理二进制数据

serde Fundamentals
`serde` 基础

serde (SERialize/DEserialize) is the universal serialization framework for Rust. It separates the data model from the format:
serde 是 Rust 世界里几乎通用的序列化框架。它把数据模型和数据格式这两件事拆开了：

use serde::{Serialize, Deserialize};

#[derive(Debug, Serialize, Deserialize)]
struct ServerConfig {
    name: String,
    port: u16,
    #[serde(default)]
    max_connections: usize,
    #[serde(skip_serializing_if = "Option::is_none")]
    tls_cert_path: Option<String>,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let json_input = r#"{
        "name": "hw-diag",
        "port": 8080
    }"#;
    let config: ServerConfig = serde_json::from_str(json_input)?;
    println!("{config:?}");

    let output = serde_json::to_string_pretty(&config)?;
    println!("{output}");

    let toml_input = r#"
        name = "hw-diag"
        port = 8080
    "#;
    let config: ServerConfig = toml::from_str(toml_input)?;
    println!("{config:?}");

    Ok(())
}

Key insight: Derive Serialize and Deserialize once, and the same struct immediately works with every serde-compatible format.
关键点：一个结构体只要把 Serialize 和 Deserialize derive 上，立刻就能接入所有兼容 serde 的格式。

Common serde Attributes
常见 `serde` 属性

serde provides a lot of control through container and field attributes:
serde 可以通过容器级和字段级属性做非常细的控制：

use serde::{Serialize, Deserialize};

#[derive(Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
#[serde(deny_unknown_fields)]
struct DiagResult {
    test_name: String,
    pass_count: u32,
    fail_count: u32,
}

#[derive(Serialize, Deserialize)]
struct Sensor {
    #[serde(rename = "sensor_id")]
    id: u64,

    #[serde(default)]
    enabled: bool,

    #[serde(default = "default_threshold")]
    threshold: f64,

    #[serde(skip)]
    cached_value: Option<f64>,

    #[serde(skip_serializing_if = "Vec::is_empty")]
    tags: Vec<String>,

    #[serde(flatten)]
    metadata: Metadata,

    #[serde(with = "hex_bytes")]
    raw_data: Vec<u8>,
}

fn default_threshold() -> f64 { 1.0 }

#[derive(Serialize, Deserialize)]
struct Metadata {
    vendor: String,
    model: String,
}

Most-used attributes cheat sheet:
最常用属性速查：

Attribute 属性	Level 层级	Effect 作用
`rename_all = "camelCase"`	Container 容器级	Rename all fields to a target naming convention 统一改字段命名风格
`deny_unknown_fields`	Container	Error on unexpected keys 遇到额外字段直接报错
`default`	Field 字段级	Use `Default::default()` when missing 缺失时使用默认值
`rename = "..."`	Field	Custom serialized name 自定义字段名
`skip`	Field	Exclude from ser/de entirely 序列化和反序列化都跳过
`skip_serializing_if = "fn"`	Field	Conditionally skip on serialize 按条件跳过序列化
`flatten`	Field	Inline nested fields 把嵌套结构拍平
`with = "module"`	Field	Use custom ser/de module 指定自定义序列化模块
`alias = "..."`	Field	Accept alternative names when deserializing 反序列化时接受别名
`untagged`	Enum	Match enum variants by shape 按数据形状匹配枚举变体

Enum Representations
枚举表示方式

serde provides four common enum representations in formats like JSON:
在 JSON 这类格式里，serde 常见的枚举表示方式主要有四种：

use serde::{Serialize, Deserialize};

#[derive(Serialize, Deserialize)]
enum Command {
    Reboot,
    RunDiag { test_name: String, timeout_secs: u64 },
    SetFanSpeed(u8),
}

#[derive(Serialize, Deserialize)]
#[serde(tag = "type")]
enum Event {
    Start { timestamp: u64 },
    Error { code: i32, message: String },
    End   { timestamp: u64, success: bool },
}

#[derive(Serialize, Deserialize)]
#[serde(tag = "t", content = "c")]
enum Payload {
    Text(String),
    Binary(Vec<u8>),
}

#[derive(Serialize, Deserialize)]
#[serde(untagged)]
enum StringOrNumber {
    Str(String),
    Num(f64),
}

Which representation to choose: Internally tagged enums are usually the best default for JSON APIs. untagged is powerful, but it relies on variant matching order and can become ambiguous fast.
怎么选：对 JSON API 来说，带内部标签的枚举通常是最稳妥的默认方案。untagged 虽然灵活，但它依赖变体匹配顺序，复杂一点就容易歪。

Zero-Copy Deserialization
零拷贝反序列化

serde can deserialize borrowed data directly from the input buffer, avoiding extra string allocations:
serde 可以直接从输入缓冲区里借用数据做反序列化，省掉额外的字符串分配：

use serde::Deserialize;

#[derive(Deserialize)]
struct OwnedRecord {
    name: String,
    value: String,
}

#[derive(Deserialize)]
struct BorrowedRecord<'a> {
    name: &'a str,
    value: &'a str,
}

fn main() {
    let input = r#"{"name": "cpu_temp", "value": "72.5"}"#;

    let owned: OwnedRecord = serde_json::from_str(input).unwrap();
    let borrowed: BorrowedRecord = serde_json::from_str(input).unwrap();

    println!("{}: {}", borrowed.name, borrowed.value);
}

When to use zero-copy:
什么时候该用零拷贝：

Parsing large files where only part of the data is used
解析大文件，但只关心其中一部分字段
High-throughput pipelines such as packets or log streams
高吞吐数据管线，比如网络包、日志流
The input buffer is guaranteed to live long enough
输入缓冲区的生命周期本身就够长

When not to use zero-copy:
什么时候别硬上零拷贝：

Input buffers are short-lived or will be reused immediately
输入缓冲区寿命很短，或者很快会被复用
Results need to outlive the source buffer
结果对象需要活得比源缓冲区更久
Fields need transformation or normalization
字段需要额外变换、转义或规范化

Practical tip: Cow<'a, str> is often the sweet spot — borrow when possible, allocate when necessary.
实战建议：Cow<'a, str> 经常是个折中神器，能借用时就借用，必须分配时再分配。

The Format Ecosystem
格式生态

Format 格式	Crate	Human-Readable 人类可读	Size 体积	Speed 速度	Use Case 适用场景
JSON	`serde_json`	✅	Large 偏大	Good 不错	Config, REST, logging 配置、REST、日志
TOML	`toml`	✅	Medium	Good	Config files 配置文件
YAML	`serde_yaml`	✅	Medium	Good	Nested config 复杂嵌套配置
bincode	`bincode`	❌	Small	Fast	Rust-to-Rust IPC, cache Rust 内部 IPC、缓存
postcard	`postcard`	❌	Tiny	Very fast	Embedded, `no_std` 嵌入式、`no_std`
MessagePack	`rmp-serde`	❌	Small	Fast	Cross-language binary protocol 跨语言二进制协议
CBOR	`ciborium`	❌	Small	Fast	IoT, constrained systems IoT、受限系统

#![allow(unused)]
fn main() {
#[derive(serde::Serialize, serde::Deserialize, Debug)]
struct DiagConfig {
    name: String,
    tests: Vec<String>,
    timeout_secs: u64,
}
}

Choose your format: Human-edited config usually wants TOML or JSON. Rust-to-Rust binary traffic likes bincode. Cross-language binary protocols often prefer MessagePack or CBOR. Embedded systems lean toward postcard.
怎么选格式：人类要手改配置，就优先 TOML 或 JSON；Rust 内部二进制通信，bincode 很顺手；跨语言二进制协议更适合 MessagePack 或 CBOR；嵌入式环境则常常偏向 postcard。

Binary Data and `repr(C)`
二进制数据与 `repr(C)`

Low-level diagnostics often deal with binary protocols and hardware register layouts. Rust gives a few important tools for that job:
底层诊断程序经常要直接面对二进制协议和硬件寄存器布局。Rust 在这方面有几样特别关键的工具：

#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Debug, Clone, Copy)]
struct IpmiHeader {
    rs_addr: u8,
    net_fn_lun: u8,
    checksum: u8,
    rq_addr: u8,
    rq_seq_lun: u8,
    cmd: u8,
}

impl IpmiHeader {
    fn from_bytes(data: &[u8]) -> Option<Self> {
        if data.len() < std::mem::size_of::<Self>() {
            return None;
        }
        Some(IpmiHeader {
            rs_addr:     data[0],
            net_fn_lun:  data[1],
            checksum:    data[2],
            rq_addr:     data[3],
            rq_seq_lun:  data[4],
            cmd:         data[5],
        })
    }
}

#[repr(C, packed)]
#[derive(Debug, Clone, Copy)]
struct PcieCapabilityHeader {
    cap_id: u8,
    next_cap: u8,
    cap_reg: u16,
}
}

repr(C) gives a predictable C-like layout. repr(C, packed) removes padding, but comes with alignment hazards, so field references must be handled very carefully.
repr(C) 会给出更可预测、接近 C 的内存布局。repr(C, packed) 会进一步去掉填充，但也会带来对齐风险，所以字段引用必须非常小心。

`zerocopy` and `bytemuck` — Safe Transmutation Helpers
`zerocopy` 和 `bytemuck`：更安全的位级转换帮手

Instead of leaning on raw unsafe transmute, these crates prove more invariants at compile time:
比起直接上生猛的 unsafe transmute，这些 crate 会在编译期多帮忙验证一些关键不变量：

#![allow(unused)]
fn main() {
use zerocopy::{FromBytes, IntoBytes, KnownLayout, Immutable};

#[derive(FromBytes, IntoBytes, KnownLayout, Immutable, Debug)]
#[repr(C)]
struct SensorReading {
    sensor_id: u16,
    flags: u8,
    _reserved: u8,
    value: u32,
}

use bytemuck::{Pod, Zeroable};

#[derive(Pod, Zeroable, Clone, Copy, Debug)]
#[repr(C)]
struct GpuRegister {
    address: u32,
    value: u32,
}
}

Approach 方式	Safety 安全性	Overhead 开销	Use When 适用场景
Manual parsing 手工按字段解析	✅	Copy fields 需要复制字段	Small structs, odd layouts 小结构体、复杂布局
`zerocopy`	✅	Zero-copy	Big buffers, strict layout checks 大缓冲区、严格布局检查
`bytemuck`	✅	Zero-copy	Simple `Pod` types 简单 `Pod` 类型
`unsafe transmute`	❌	Zero-copy	Last resort only 最后兜底，尽量别碰

`bytes::Bytes` — Reference-Counted Buffers
`bytes::Bytes`：引用计数缓冲区

The bytes crate is popular in async and network stacks because it supports cheap cloning and zero-copy slicing:
bytes crate 在异步和网络栈里特别常见，因为它支持廉价克隆和零拷贝切片：

use bytes::{Bytes, BytesMut, Buf, BufMut};

fn main() {
    let mut buf = BytesMut::with_capacity(1024);
    buf.put_u8(0x01);
    buf.put_u16(0x1234);
    buf.put_slice(b"hello");

    let data: Bytes = buf.freeze();
    let data2 = data.clone();   // cheap clone
    let slice = data.slice(3..8); // zero-copy sub-slice

    let mut reader = &data[..];
    let byte = reader.get_u8();
    let short = reader.get_u16();

    let mut original = Bytes::from_static(b"HEADER\x00PAYLOAD");
    let header = original.split_to(6);

    println!("{:?} {:?} {:?}", byte, short, slice);
    println!("{:?} {:?}", &header[..], &original[..]);
}

Feature 能力	`Vec<u8>`	`Bytes`
Clone cost 克隆开销	O(n) deep copy 深拷贝	O(1) refcount bump 只加引用计数
Sub-slicing 子切片	Borrowed slice 借用切片	Owned shared slice 共享所有权切片
Thread safety 线程安全	Needs extra wrapping 通常还得包一层	`Send + Sync` ready
Ecosystem fit 生态适配	Standard library	tokio / hyper / tonic / axum

When to use Bytes: It shines when one incoming buffer needs to be split, cloned, and handed to multiple components without copying the payload over and over again.
什么时候该用 Bytes：最适合那种“收到一大块缓冲区后，要切成几段、克隆几份，再交给多个组件继续处理”的场景，因为它能避免一遍又一遍地复制载荷数据。

Key Takeaways — Serialization & Binary Data
本章要点 — 序列化与二进制数据

serde 的 derive 宏可以覆盖绝大多数日常场景，剩余细节再靠属性微调
serde 的 derive 宏可以覆盖绝大多数日常场景，剩余细节再靠属性微调

零拷贝反序列化适合高读负载，但前提是输入缓冲区寿命足够长
零拷贝反序列化适合高读负载，但前提是输入缓冲区寿命足够长

repr(C)、zerocopy、bytemuck 适合低层二进制布局处理；Bytes 适合共享缓冲区
repr(C)、zerocopy、bytemuck 适合低层二进制布局处理；Bytes 适合共享缓冲区

See also: Ch 10 — Error Handling for integrating serde errors, and Ch 12 — Unsafe Rust for repr(C) and low-level layout concerns.
延伸阅读： 想看 serde 错误怎么整合进错误系统，可以看第 10 章：错误处理；想看 repr(C) 和底层布局的更多细节，可以看第 12 章：Unsafe Rust。

flowchart LR
    subgraph Input["Input Formats<br/>输入格式"]
        JSON["JSON"]
        TOML["TOML"]
        Bin["bincode"]
        MsgP["MessagePack"]
    end

    subgraph serde["serde data model<br/>serde 数据模型"]
        Ser["Serialize"]
        De["Deserialize"]
    end

    subgraph Output["Rust Types<br/>Rust 类型"]
        Struct["Rust struct"]
        Enum["Rust enum"]
    end

    JSON --> De
    TOML --> De
    Bin --> De
    MsgP --> De
    De --> Struct
    De --> Enum
    Struct --> Ser
    Enum --> Ser
    Ser --> JSON
    Ser --> Bin

    style JSON fill:#e8f4f8,stroke:#2980b9,color:#000
    style TOML fill:#e8f4f8,stroke:#2980b9,color:#000
    style Bin fill:#e8f4f8,stroke:#2980b9,color:#000
    style MsgP fill:#e8f4f8,stroke:#2980b9,color:#000
    style Ser fill:#fef9e7,stroke:#f1c40f,color:#000
    style De fill:#fef9e7,stroke:#f1c40f,color:#000
    style Struct fill:#d4efdf,stroke:#27ae60,color:#000
    style Enum fill:#d4efdf,stroke:#27ae60,color:#000

Exercise: Custom serde Deserialization ★★★ (~45 min)
练习：自定义 `serde` 反序列化 ★★★（约 45 分钟）

Design a HumanDuration wrapper that deserializes from strings like "30s", "5m", "2h" and serializes back to the same style.
设计一个 HumanDuration 包装类型，让它能从 "30s"、"5m"、"2h" 这种字符串反序列化出来，并且还能再序列化回同样的格式。

🔑 Solution
🔑 参考答案

use serde::{Deserialize, Deserializer, Serialize, Serializer};
use std::fmt;

#[derive(Debug, Clone, PartialEq)]
struct HumanDuration(std::time::Duration);

impl HumanDuration {
    fn from_str(s: &str) -> Result<Self, String> {
        let s = s.trim();
        if s.is_empty() { return Err("empty duration string".into()); }

        let (num_str, suffix) = s.split_at(
            s.find(|c: char| !c.is_ascii_digit()).unwrap_or(s.len())
        );
        let value: u64 = num_str.parse()
            .map_err(|_| format!("invalid number: {num_str}"))?;

        let duration = match suffix {
            "s" | "sec"  => std::time::Duration::from_secs(value),
            "m" | "min"  => std::time::Duration::from_secs(value * 60),
            "h" | "hr"   => std::time::Duration::from_secs(value * 3600),
            "ms"         => std::time::Duration::from_millis(value),
            other        => return Err(format!("unknown suffix: {other}")),
        };
        Ok(HumanDuration(duration))
    }
}

impl fmt::Display for HumanDuration {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        let secs = self.0.as_secs();
        if secs == 0 {
            write!(f, "{}ms", self.0.as_millis())
        } else if secs % 3600 == 0 {
            write!(f, "{}h", secs / 3600)
        } else if secs % 60 == 0 {
            write!(f, "{}m", secs / 60)
        } else {
            write!(f, "{}s", secs)
        }
    }
}

Keyboard shortcuts

Rust Patterns | Rust 模式精讲