Federation (Distributed Mode)
Status: All phases (1–10) implemented. Conflict resolution (HLC+LWW), anti-entropy (Merkle trees), distributed tombstone GC, bit rot protection, and cluster operations are complete.
S4 supports a leaderless distributed mode where multiple nodes form a cluster and replicate data using quorum protocols. This provides high availability and horizontal scaling while maintaining S3 API compatibility.
Key Concepts
Leaderless Replication
Every node in an S4 cluster is equal — there is no single leader or master. Any node can accept any S3 request and act as the coordinator for that operation. The coordinator fans out reads and writes to the appropriate replicas and collects quorum responses.
Server Pools
A cluster is composed of one or more server pools. Each pool is a fixed set of nodes that replicate data among themselves.
Cluster
+-- Pool 1: [Node A, Node B, Node C] (RF=3, all 3 nodes store all data)
+-- Pool 2: [Node D, Node E, Node F] (added later for capacity)
Rules: - Pool membership is immutable after creation — nodes are never added or removed from existing pools - Horizontal scaling = adding new pools (not new nodes to existing pools) - Each bucket is pinned to one pool at creation time; all objects in that bucket live in that pool - New buckets are created in the pool with the most free space
Quorum Parameters
Default configuration: N=3, W=2, R=2 (replication factor 3, write quorum 2, read quorum 2).
| Parameter | Default | Meaning |
|---|---|---|
| N (RF) | 3 | Each object is stored on 3 nodes |
| W | 2 | A write succeeds when 2 of 3 replicas confirm |
| R | 2 | A read succeeds when 2 of 3 replicas respond |
This guarantees: - Read-your-writes: W + R = 4 > N = 3, so at least one replica in every read quorum has the latest write - Split-brain safety: 2W = 4 > N = 3, so two partitions cannot both accept writes - Fault tolerance: Cluster continues to serve reads and writes with 1 node down
Node Identity
Each node gets a unique UUID (persisted in data_dir/node_id) on first startup. This ID never changes across restarts, ensuring the node retains ownership of its data.
Failure Detection (SWIM Gossip)
Nodes discover each other via seed addresses and use the SWIM protocol (via the foca crate) for failure detection:
- Alive → Suspect: Node fails to respond to ping (5 s timeout)
- Suspect → Dead: Node fails indirect probes through K=3 other nodes (30 s timeout)
- Metadata propagation: Each node broadcasts its capacity, version, and pool membership via gossip
Inter-Node Communication (gRPC)
Nodes communicate over gRPC (default port: 9100) for data and metadata replication:
- WriteBlob / ReadBlob — replicate object data between nodes
- WriteMetadata / ReadMetadata — replicate metadata records
- HeadObject — lightweight digest for quorum comparison
- Streaming — large objects (>8 MB) use gRPC streaming
- TLS — optional TLS for inter-node traffic
Data Placement
For the recommended deployment (pool size = RF = 3), every node in the pool stores 100% of the pool's data. No hash ring is needed — all operations broadcast to all pool nodes.
For larger pools (pool size > RF), a consistent hash ring with 128 virtual nodes per physical node routes keys to the correct RF replicas.
Hinted Handoff
When a replica is temporarily unreachable during a write: 1. The coordinator stores a hint locally (persisted to disk) 2. When the replica comes back online (detected via gossip), hints are delivered 3. Hints have a TTL (default: 3 hours) — after expiry, anti-entropy handles catch-up
Read Repair
When a quorum read detects that replicas have different versions of an object, the coordinator asynchronously sends the latest version to stale replicas. This is transparent to the client and does not add latency to reads.
Write Path (Distributed)
Client PUT /bucket/key
|
v
Coordinator (any node)
|
+-- 1. Determine pool (from bucket → pool mapping)
+-- 2. Determine replicas (all nodes in pool for RF=pool_size)
+-- 3. Generate HLC timestamp + operation_id
+-- 4. Send WriteBlob to ALL replicas in parallel
+-- 5. Wait for W=2 successful ACKs
| (each ACK means: blob written + metadata stored + fsync done)
+-- 6. Return 200 OK to client
|
+-- If <W ACKs: return 503, store hints for offline replicas
Read Path (Distributed)
Client GET /bucket/key
|
v
Coordinator (any node)
|
+-- 1. Send HeadObject (digest) to ALL replicas in parallel
+-- 2. Wait for R=2 digest responses
+-- 3. Compare HLC timestamps
|
+-- If versions match: fetch data from closest replica, return to client
+-- If versions differ: fetch data from newest replica, return to client
| async: trigger read repair on stale replica(s)
|
+-- If <R responses: return 503
Conflict Resolution (HLC + LWW)
S4 uses Hybrid Logical Clocks (HLC) for causal ordering and Last-Writer-Wins (LWW) for conflict resolution.
- Each write generates an HLC timestamp:
(wall_time, logical_counter, node_id) - Concurrent writes to the same key are resolved deterministically: highest HLC wins, with
node_idas tiebreaker - All nodes arrive at the same winner for any conflict — no coordination needed
- Clock skew > 500ms is logged as a warning
- Clients can use
If-Match/If-None-Matchfor conditional writes
Anti-Entropy (Merkle Trees)
Background service that detects and repairs divergences between replicas every 10 minutes.
How it works: 1. Each node maintains a persisted Merkle tree (BLAKE3 hashes, stored in fjall) over its key ranges 2. Every 10 minutes, each node exchanges Merkle root hashes with a random replica 3. If roots differ, the tree is traversed to find divergent keys in O(log K) comparisons 4. Divergent keys are synced to the latest version 5. Repair frontier tracks per-replica reconciliation progress — used for safe tombstone purge
Distributed Tombstones & GC
Deletes create tombstones that must live long enough for all replicas to see them.
- GC grace period: 7 days (configurable via
S4_GC_GRACE_DAYS) - Tombstone purge requires:
age > gc_graceAND repair frontier confirms all replicas have synced - Zombie resurrection protection: If a node is offline >
max_rejoin_downtime(3 days), it must full-bootstrap instead of incremental repair - Deduplication: Remains per-node (no cross-node dedup coordination). Mark-sweep GC runs every 24 hours instead of ref-count decrement
Bit Rot Protection (Data Scrubber)
Background scrubber verifies data integrity and auto-heals from replicas.
- CRC32 scan: All volumes scanned over a configurable period (default: 30 days)
- Auto-heal: Corrupted blobs are fetched from a healthy replica and replaced locally
- Deep verify: SHA-256 content hash verification available via admin API (runs every 90 days or on demand)
- I/O throttling: Scrubber uses low-priority I/O to avoid impacting normal operations
- Metrics:
blobs_scanned_total,corruptions_found_total,corruptions_healed_total,scan_progress
Cluster Operations
Deployment Modes
| Mode | S4_MODE |
Description |
|---|---|---|
| Single-node | single (default) |
Standard standalone server, no cluster overhead |
| Cluster | cluster |
Storage node with full replication |
| Gateway | gateway |
Stateless router — no local storage, forwards to cluster nodes |
Graceful Shutdown
- Stop accepting new coordinated requests
- Wait for in-flight operations (timeout: 30s)
- Flush pending hints
- Broadcast
NodeStatus::Leftvia gossip - Sync fjall, close volumes, exit
Rolling Upgrades
- Pre-check: all nodes alive, anti-entropy caught up, no pending hints
- Drain one node → upgrade binary → restart → health check
- Validate: PUT/GET/DELETE test, inter-node communication, anti-entropy
- Repeat for remaining nodes one at a time
- Protocol versioning ensures backward compatibility within minor versions
Admin API (Cluster)
| Endpoint | Method | Description |
|---|---|---|
/admin/cluster/health |
GET | Cluster health, node statuses, epoch |
/admin/cluster/topology |
GET | Pools, hash rings, node assignments |
/admin/node/health |
GET | Individual node status and version |
/admin/cluster/pool |
POST | Add a new pool (no downtime) |
/admin/node/drain |
POST | Graceful drain for rolling upgrade |
/admin/node/bootstrap |
POST | Full bootstrap for nodes offline > max_rejoin_downtime |
/admin/cluster/scrub |
POST | Trigger manual data scrub |
/admin/cluster/repair-status |
GET | Repair frontier, divergent keys, last repair time |
Cluster Configuration (Environment Variables)
| Variable | Description | Default |
|---|---|---|
S4_MODE |
Operating mode: single, cluster, gateway |
single |
S4_CLUSTER_NAME |
Cluster name for network isolation | default |
S4_NODE_ID |
Human-readable node name (UUID auto-generated internally) | Auto |
S4_NODE_GRPC_ADDR |
gRPC listen address for inter-node communication | — |
S4_NODE_HTTP_ADDR |
HTTP address advertised to other nodes | — |
S4_SEEDS |
Comma-separated seed node gRPC addresses | — |
S4_POOL_NAME |
Pool this node belongs to | — |
S4_POOL_NODES |
Pool members: id:addr,id:addr,... |
— |
S4_REPLICATION_FACTOR |
Replication factor (N) | 3 |
S4_WRITE_QUORUM |
Write quorum (W) | 2 |
S4_READ_QUORUM |
Read quorum (R) | 2 |
S4_GC_GRACE_DAYS |
Tombstone GC grace period (days) | 7 |
S4_MAX_REJOIN_DOWNTIME_DAYS |
Max offline days before full bootstrap required | 3 |
S4_ANTI_ENTROPY_INTERVAL_SECS |
Anti-entropy Merkle exchange interval | 600 |
S4_SCRUBBER_FULL_SCAN_DAYS |
Full CRC32 scrub cycle (days) | 30 |
S4_HINT_TTL_HOURS |
Hinted handoff TTL | 3 |
Docker Compose (3-Node Cluster)
services:
s4-node1:
image: s4core:latest
environment:
S4_MODE: cluster
S4_CLUSTER_NAME: dev
S4_NODE_ID: node-1
S4_NODE_GRPC_ADDR: s4-node1:9100
S4_NODE_HTTP_ADDR: s4-node1:9000
S4_SEEDS: s4-node1:9100,s4-node2:9100,s4-node3:9100
S4_POOL_NAME: pool-1
S4_POOL_NODES: "node-1:s4-node1:9100,node-2:s4-node2:9100,node-3:s4-node3:9100"
S4_DATA_DIR: /data
S4_ACCESS_KEY_ID: minioadmin
S4_SECRET_ACCESS_KEY: minioadmin
volumes:
- s4-data-1:/data
ports:
- "9001:9000"
s4-node2:
image: s4core:latest
environment:
S4_MODE: cluster
S4_NODE_ID: node-2
S4_NODE_GRPC_ADDR: s4-node2:9100
S4_NODE_HTTP_ADDR: s4-node2:9000
S4_SEEDS: s4-node1:9100,s4-node2:9100,s4-node3:9100
S4_POOL_NAME: pool-1
S4_POOL_NODES: "node-1:s4-node1:9100,node-2:s4-node2:9100,node-3:s4-node3:9100"
S4_DATA_DIR: /data
S4_ACCESS_KEY_ID: minioadmin
S4_SECRET_ACCESS_KEY: minioadmin
volumes:
- s4-data-2:/data
s4-node3:
image: s4core:latest
environment:
S4_MODE: cluster
S4_NODE_ID: node-3
S4_NODE_GRPC_ADDR: s4-node3:9100
S4_NODE_HTTP_ADDR: s4-node3:9000
S4_SEEDS: s4-node1:9100,s4-node2:9100,s4-node3:9100
S4_POOL_NAME: pool-1
S4_POOL_NODES: "node-1:s4-node1:9100,node-2:s4-node2:9100,node-3:s4-node3:9100"
S4_DATA_DIR: /data
S4_ACCESS_KEY_ID: minioadmin
S4_SECRET_ACCESS_KEY: minioadmin
volumes:
- s4-data-3:/data
volumes:
s4-data-1:
s4-data-2:
s4-data-3:
CE vs EE
| Feature | Community Edition | Enterprise Edition |
|---|---|---|
| Pools | 1 pool | Unlimited |
| Nodes per pool | 3 max | Unlimited |
| Gossip & quorum | Full | Full |
| Audit logging | No | Yes |
| Rolling upgrades | No | Yes |
| Deep scrub | No | Yes |
| Dead node replacement | No | Yes |