An introduction to content-defined chunking: why fixed-size splitting fails, how content-aware boundaries solve the deduplication problem, and a taxonomy of three CDC algorithm families.
An exploration of FastCDC's Gear hash, normalized chunking with dual masks, and the 2020 two-byte-per-iteration optimization, with code in pseudocode, Rust, and TypeScript.
See CDC-based deduplication in action, learn where CDC is deployed today, and explore the frontier of structure-aware chunking for source code.
CDC chunks are the right logical unit for deduplication, but storing them as individual objects is prohibitively expensive. This post explores containers, the storage abstraction that makes CDC viable at scale, and the fragmentation, garbage collection, and restore challenges they introduce.
Cloud object storage can be expensive for CDC at scale. This post explores cost-saving alternatives: challenger storage providers with radically different pricing, and the role caching plays under Zipf access patterns to drive costs down further.