11. Serialization, Zero-Copy, and Binary Data 🟡
# 11. 序列化、零拷贝与二进制数据 🟡
What you’ll learn:
本章将学到什么:
- serde fundamentals: derive macros, attributes, and enum representations
serde的基础:derive 宏、属性和枚举表示方式- Zero-copy deserialization for high-performance read-heavy workloads
面向高读负载场景的零拷贝反序列化- The serde format ecosystem (JSON, TOML, bincode, MessagePack)
serde生态里的各种格式:JSON、TOML、bincode、MessagePack 等- Binary data handling with
repr(C),zerocopy, andbytes::Bytes
如何用repr(C)、zerocopy和bytes::Bytes处理二进制数据
serde Fundamentals
serde 基础
serde (SERialize/DEserialize) is the universal serialization framework for Rust. It separates the data model from the format:serde 是 Rust 世界里几乎通用的序列化框架。它把数据模型和数据格式这两件事拆开了:
use serde::{Serialize, Deserialize};
#[derive(Debug, Serialize, Deserialize)]
struct ServerConfig {
name: String,
port: u16,
#[serde(default)]
max_connections: usize,
#[serde(skip_serializing_if = "Option::is_none")]
tls_cert_path: Option<String>,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let json_input = r#"{
"name": "hw-diag",
"port": 8080
}"#;
let config: ServerConfig = serde_json::from_str(json_input)?;
println!("{config:?}");
let output = serde_json::to_string_pretty(&config)?;
println!("{output}");
let toml_input = r#"
name = "hw-diag"
port = 8080
"#;
let config: ServerConfig = toml::from_str(toml_input)?;
println!("{config:?}");
Ok(())
}
Key insight: Derive
SerializeandDeserializeonce, and the same struct immediately works with every serde-compatible format.
关键点:一个结构体只要把Serialize和Deserializederive 上,立刻就能接入所有兼容serde的格式。
Common serde Attributes
常见 serde 属性
serde provides a lot of control through container and field attributes:serde 可以通过容器级和字段级属性做非常细的控制:
use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
#[serde(deny_unknown_fields)]
struct DiagResult {
test_name: String,
pass_count: u32,
fail_count: u32,
}
#[derive(Serialize, Deserialize)]
struct Sensor {
#[serde(rename = "sensor_id")]
id: u64,
#[serde(default)]
enabled: bool,
#[serde(default = "default_threshold")]
threshold: f64,
#[serde(skip)]
cached_value: Option<f64>,
#[serde(skip_serializing_if = "Vec::is_empty")]
tags: Vec<String>,
#[serde(flatten)]
metadata: Metadata,
#[serde(with = "hex_bytes")]
raw_data: Vec<u8>,
}
fn default_threshold() -> f64 { 1.0 }
#[derive(Serialize, Deserialize)]
struct Metadata {
vendor: String,
model: String,
}
Most-used attributes cheat sheet:
最常用属性速查:
| Attribute 属性 | Level 层级 | Effect 作用 |
|---|---|---|
rename_all = "camelCase" | Container 容器级 | Rename all fields to a target naming convention 统一改字段命名风格 |
deny_unknown_fields | Container | Error on unexpected keys 遇到额外字段直接报错 |
default | Field 字段级 | Use Default::default() when missing缺失时使用默认值 |
rename = "..." | Field | Custom serialized name 自定义字段名 |
skip | Field | Exclude from ser/de entirely 序列化和反序列化都跳过 |
skip_serializing_if = "fn" | Field | Conditionally skip on serialize 按条件跳过序列化 |
flatten | Field | Inline nested fields 把嵌套结构拍平 |
with = "module" | Field | Use custom ser/de module 指定自定义序列化模块 |
alias = "..." | Field | Accept alternative names when deserializing 反序列化时接受别名 |
untagged | Enum | Match enum variants by shape 按数据形状匹配枚举变体 |
Enum Representations
枚举表示方式
serde provides four common enum representations in formats like JSON:
在 JSON 这类格式里,serde 常见的枚举表示方式主要有四种:
use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize)]
enum Command {
Reboot,
RunDiag { test_name: String, timeout_secs: u64 },
SetFanSpeed(u8),
}
#[derive(Serialize, Deserialize)]
#[serde(tag = "type")]
enum Event {
Start { timestamp: u64 },
Error { code: i32, message: String },
End { timestamp: u64, success: bool },
}
#[derive(Serialize, Deserialize)]
#[serde(tag = "t", content = "c")]
enum Payload {
Text(String),
Binary(Vec<u8>),
}
#[derive(Serialize, Deserialize)]
#[serde(untagged)]
enum StringOrNumber {
Str(String),
Num(f64),
}
Which representation to choose: Internally tagged enums are usually the best default for JSON APIs.
untaggedis powerful, but it relies on variant matching order and can become ambiguous fast.
怎么选:对 JSON API 来说,带内部标签的枚举通常是最稳妥的默认方案。untagged虽然灵活,但它依赖变体匹配顺序,复杂一点就容易歪。
Zero-Copy Deserialization
零拷贝反序列化
serde can deserialize borrowed data directly from the input buffer, avoiding extra string allocations:serde 可以直接从输入缓冲区里借用数据做反序列化,省掉额外的字符串分配:
use serde::Deserialize;
#[derive(Deserialize)]
struct OwnedRecord {
name: String,
value: String,
}
#[derive(Deserialize)]
struct BorrowedRecord<'a> {
name: &'a str,
value: &'a str,
}
fn main() {
let input = r#"{"name": "cpu_temp", "value": "72.5"}"#;
let owned: OwnedRecord = serde_json::from_str(input).unwrap();
let borrowed: BorrowedRecord = serde_json::from_str(input).unwrap();
println!("{}: {}", borrowed.name, borrowed.value);
}
When to use zero-copy:
什么时候该用零拷贝:
- Parsing large files where only part of the data is used
解析大文件,但只关心其中一部分字段 - High-throughput pipelines such as packets or log streams
高吞吐数据管线,比如网络包、日志流 - The input buffer is guaranteed to live long enough
输入缓冲区的生命周期本身就够长
When not to use zero-copy:
什么时候别硬上零拷贝:
- Input buffers are short-lived or will be reused immediately
输入缓冲区寿命很短,或者很快会被复用 - Results need to outlive the source buffer
结果对象需要活得比源缓冲区更久 - Fields need transformation or normalization
字段需要额外变换、转义或规范化
Practical tip:
Cow<'a, str>is often the sweet spot — borrow when possible, allocate when necessary.
实战建议:Cow<'a, str>经常是个折中神器,能借用时就借用,必须分配时再分配。
The Format Ecosystem
格式生态
| Format 格式 | Crate | Human-Readable 人类可读 | Size 体积 | Speed 速度 | Use Case 适用场景 |
|---|---|---|---|---|---|
| JSON | serde_json | ✅ | Large 偏大 | Good 不错 | Config, REST, logging 配置、REST、日志 |
| TOML | toml | ✅ | Medium | Good | Config files 配置文件 |
| YAML | serde_yaml | ✅ | Medium | Good | Nested config 复杂嵌套配置 |
| bincode | bincode | ❌ | Small | Fast | Rust-to-Rust IPC, cache Rust 内部 IPC、缓存 |
| postcard | postcard | ❌ | Tiny | Very fast | Embedded, no_std嵌入式、 no_std |
| MessagePack | rmp-serde | ❌ | Small | Fast | Cross-language binary protocol 跨语言二进制协议 |
| CBOR | ciborium | ❌ | Small | Fast | IoT, constrained systems IoT、受限系统 |
#![allow(unused)]
fn main() {
#[derive(serde::Serialize, serde::Deserialize, Debug)]
struct DiagConfig {
name: String,
tests: Vec<String>,
timeout_secs: u64,
}
}
Choose your format: Human-edited config usually wants TOML or JSON. Rust-to-Rust binary traffic likes
bincode. Cross-language binary protocols often prefer MessagePack or CBOR. Embedded systems lean towardpostcard.
怎么选格式:人类要手改配置,就优先 TOML 或 JSON;Rust 内部二进制通信,bincode很顺手;跨语言二进制协议更适合 MessagePack 或 CBOR;嵌入式环境则常常偏向postcard。
Binary Data and repr(C)
二进制数据与 repr(C)
Low-level diagnostics often deal with binary protocols and hardware register layouts. Rust gives a few important tools for that job:
底层诊断程序经常要直接面对二进制协议和硬件寄存器布局。Rust 在这方面有几样特别关键的工具:
#![allow(unused)]
fn main() {
#[repr(C)]
#[derive(Debug, Clone, Copy)]
struct IpmiHeader {
rs_addr: u8,
net_fn_lun: u8,
checksum: u8,
rq_addr: u8,
rq_seq_lun: u8,
cmd: u8,
}
impl IpmiHeader {
fn from_bytes(data: &[u8]) -> Option<Self> {
if data.len() < std::mem::size_of::<Self>() {
return None;
}
Some(IpmiHeader {
rs_addr: data[0],
net_fn_lun: data[1],
checksum: data[2],
rq_addr: data[3],
rq_seq_lun: data[4],
cmd: data[5],
})
}
}
#[repr(C, packed)]
#[derive(Debug, Clone, Copy)]
struct PcieCapabilityHeader {
cap_id: u8,
next_cap: u8,
cap_reg: u16,
}
}
repr(C) gives a predictable C-like layout. repr(C, packed) removes padding, but comes with alignment hazards, so field references must be handled very carefully.repr(C) 会给出更可预测、接近 C 的内存布局。repr(C, packed) 会进一步去掉填充,但也会带来对齐风险,所以字段引用必须非常小心。
zerocopy and bytemuck — Safe Transmutation Helpers
zerocopy 和 bytemuck:更安全的位级转换帮手
Instead of leaning on raw unsafe transmute, these crates prove more invariants at compile time:
比起直接上生猛的 unsafe transmute,这些 crate 会在编译期多帮忙验证一些关键不变量:
#![allow(unused)]
fn main() {
use zerocopy::{FromBytes, IntoBytes, KnownLayout, Immutable};
#[derive(FromBytes, IntoBytes, KnownLayout, Immutable, Debug)]
#[repr(C)]
struct SensorReading {
sensor_id: u16,
flags: u8,
_reserved: u8,
value: u32,
}
use bytemuck::{Pod, Zeroable};
#[derive(Pod, Zeroable, Clone, Copy, Debug)]
#[repr(C)]
struct GpuRegister {
address: u32,
value: u32,
}
}
| Approach 方式 | Safety 安全性 | Overhead 开销 | Use When 适用场景 |
|---|---|---|---|
| Manual parsing 手工按字段解析 | ✅ | Copy fields 需要复制字段 | Small structs, odd layouts 小结构体、复杂布局 |
zerocopy | ✅ | Zero-copy | Big buffers, strict layout checks 大缓冲区、严格布局检查 |
bytemuck | ✅ | Zero-copy | Simple Pod types简单 Pod 类型 |
unsafe transmute | ❌ | Zero-copy | Last resort only 最后兜底,尽量别碰 |
bytes::Bytes — Reference-Counted Buffers
bytes::Bytes:引用计数缓冲区
The bytes crate is popular in async and network stacks because it supports cheap cloning and zero-copy slicing:bytes crate 在异步和网络栈里特别常见,因为它支持廉价克隆和零拷贝切片:
use bytes::{Bytes, BytesMut, Buf, BufMut};
fn main() {
let mut buf = BytesMut::with_capacity(1024);
buf.put_u8(0x01);
buf.put_u16(0x1234);
buf.put_slice(b"hello");
let data: Bytes = buf.freeze();
let data2 = data.clone(); // cheap clone
let slice = data.slice(3..8); // zero-copy sub-slice
let mut reader = &data[..];
let byte = reader.get_u8();
let short = reader.get_u16();
let mut original = Bytes::from_static(b"HEADER\x00PAYLOAD");
let header = original.split_to(6);
println!("{:?} {:?} {:?}", byte, short, slice);
println!("{:?} {:?}", &header[..], &original[..]);
}
| Feature 能力 | Vec<u8> | Bytes |
|---|---|---|
| Clone cost 克隆开销 | O(n) deep copy 深拷贝 | O(1) refcount bump 只加引用计数 |
| Sub-slicing 子切片 | Borrowed slice 借用切片 | Owned shared slice 共享所有权切片 |
| Thread safety 线程安全 | Needs extra wrapping 通常还得包一层 | Send + Sync ready |
| Ecosystem fit 生态适配 | Standard library | tokio / hyper / tonic / axum |
When to use
Bytes: It shines when one incoming buffer needs to be split, cloned, and handed to multiple components without copying the payload over and over again.
什么时候该用Bytes:最适合那种“收到一大块缓冲区后,要切成几段、克隆几份,再交给多个组件继续处理”的场景,因为它能避免一遍又一遍地复制载荷数据。
Key Takeaways — Serialization & Binary Data
本章要点 — 序列化与二进制数据
serde的 derive 宏可以覆盖绝大多数日常场景,剩余细节再靠属性微调serde的 derive 宏可以覆盖绝大多数日常场景,剩余细节再靠属性微调- 零拷贝反序列化适合高读负载,但前提是输入缓冲区寿命足够长
零拷贝反序列化适合高读负载,但前提是输入缓冲区寿命足够长repr(C)、zerocopy、bytemuck适合低层二进制布局处理;Bytes适合共享缓冲区repr(C)、zerocopy、bytemuck适合低层二进制布局处理;Bytes适合共享缓冲区
See also: Ch 10 — Error Handling for integrating serde errors, and Ch 12 — Unsafe Rust for
repr(C)and low-level layout concerns.
延伸阅读: 想看serde错误怎么整合进错误系统,可以看 第 10 章:错误处理;想看repr(C)和底层布局的更多细节,可以看 第 12 章:Unsafe Rust。
flowchart LR
subgraph Input["Input Formats<br/>输入格式"]
JSON["JSON"]
TOML["TOML"]
Bin["bincode"]
MsgP["MessagePack"]
end
subgraph serde["serde data model<br/>serde 数据模型"]
Ser["Serialize"]
De["Deserialize"]
end
subgraph Output["Rust Types<br/>Rust 类型"]
Struct["Rust struct"]
Enum["Rust enum"]
end
JSON --> De
TOML --> De
Bin --> De
MsgP --> De
De --> Struct
De --> Enum
Struct --> Ser
Enum --> Ser
Ser --> JSON
Ser --> Bin
style JSON fill:#e8f4f8,stroke:#2980b9,color:#000
style TOML fill:#e8f4f8,stroke:#2980b9,color:#000
style Bin fill:#e8f4f8,stroke:#2980b9,color:#000
style MsgP fill:#e8f4f8,stroke:#2980b9,color:#000
style Ser fill:#fef9e7,stroke:#f1c40f,color:#000
style De fill:#fef9e7,stroke:#f1c40f,color:#000
style Struct fill:#d4efdf,stroke:#27ae60,color:#000
style Enum fill:#d4efdf,stroke:#27ae60,color:#000
Exercise: Custom serde Deserialization ★★★ (~45 min)
练习:自定义 serde 反序列化 ★★★(约 45 分钟)
Design a HumanDuration wrapper that deserializes from strings like "30s", "5m", "2h" and serializes back to the same style.
设计一个 HumanDuration 包装类型,让它能从 "30s"、"5m"、"2h" 这种字符串反序列化出来,并且还能再序列化回同样的格式。
🔑 Solution
🔑 参考答案
use serde::{Deserialize, Deserializer, Serialize, Serializer};
use std::fmt;
#[derive(Debug, Clone, PartialEq)]
struct HumanDuration(std::time::Duration);
impl HumanDuration {
fn from_str(s: &str) -> Result<Self, String> {
let s = s.trim();
if s.is_empty() { return Err("empty duration string".into()); }
let (num_str, suffix) = s.split_at(
s.find(|c: char| !c.is_ascii_digit()).unwrap_or(s.len())
);
let value: u64 = num_str.parse()
.map_err(|_| format!("invalid number: {num_str}"))?;
let duration = match suffix {
"s" | "sec" => std::time::Duration::from_secs(value),
"m" | "min" => std::time::Duration::from_secs(value * 60),
"h" | "hr" => std::time::Duration::from_secs(value * 3600),
"ms" => std::time::Duration::from_millis(value),
other => return Err(format!("unknown suffix: {other}")),
};
Ok(HumanDuration(duration))
}
}
impl fmt::Display for HumanDuration {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
let secs = self.0.as_secs();
if secs == 0 {
write!(f, "{}ms", self.0.as_millis())
} else if secs % 3600 == 0 {
write!(f, "{}h", secs / 3600)
} else if secs % 60 == 0 {
write!(f, "{}m", secs / 60)
} else {
write!(f, "{}s", secs)
}
}
}