Content Deduplication
S4 automatically deduplicates stored data using content-addressable storage (CAS). When two or more objects have identical content, only one copy is stored on disk.
How It Works
- When an object is uploaded, S4 computes its SHA-256 hash
- The hash is checked against the deduplication index
- If the hash already exists, the object data is not written again — only a new metadata reference is created
- A reference counter tracks how many objects point to each unique blob
- When all references to a blob are deleted, the space becomes reclaimable by the compactor
Storage Savings
Deduplication provides 30-50% space savings on typical workloads. Savings depend on how much duplicate content exists in your data.
Examples of high-dedup workloads: - Backup storage (incremental backups share most data) - Container image registries (layers are often shared) - Log storage (repeated patterns) - File sync services (multiple users uploading the same files)
Dedup Statistics
Check deduplication effectiveness via the stats API:
curl http://localhost:9000/api/stats
Response includes:
{
"dedup_unique_blobs": 1500,
"dedup_total_references": 3200,
"dedup_ratio": 2.13
}
| Field | Description |
|---|---|
dedup_unique_blobs |
Number of unique data blobs on disk |
dedup_total_references |
Total number of object references |
dedup_ratio |
Ratio of references to unique blobs (higher = more savings) |
A ratio of 2.13 means that on average, each unique blob is referenced by 2.13 objects — roughly 53% storage savings.
Interaction with Other Features
- Versioning: Each version is deduplicated independently. If version 1 and version 3 have identical content, only one copy is stored.
- Object Lock: Deduplication is transparent to Object Lock. Locked objects are protected regardless of dedup status.
- Lifecycle Policies: When lifecycle rules delete objects, dedup reference counts are decremented. The actual data is removed only when no references remain.
Design Details
- Hashing algorithm: SHA-256 (cryptographically strong, no collisions in practice)
- Dedup granularity: whole-object (not block-level)
- All objects are deduplicated regardless of size
- Dedup index is stored in a dedicated fjall keyspace alongside other metadata