Dynamic Role Transitions: Building a Self-Healing Distributed System in Go
February 2, 2026Dynamic Role Transitions: Building a Self-Healing Distributed System in Go
In the previous post, we implemented leader election and service discovery using ZooKeeper. But knowing who the leader is only half the battle — we also need to act on that information.
Today we'll build the OnElectionAction component that handles role transitions, turning our nodes into a self-healing cluster that automatically adapts when nodes join, leave, or fail.
The Challenge
When a node's role changes, several things need to happen:
Becoming a Leader (Coordinator):
- Stop serving as a worker
- Unregister from the workers registry
- Start watching for worker changes
- Start the coordinator server
- Register as a coordinator
Becoming a Worker:
- Stop serving as coordinator (if applicable)
- Unregister from coordinators registry
- Start the worker server
- Register as a worker
All of this needs to happen atomically and safely, even under concurrent access.
The State Machine
Our node can be in one of three states:
┌──────────┐ OnElectedToBeLeader() ┌──────────┐│ NONE │ ─────────────────────────────▶│ LEADER │└──────────┘ └──────────┘ │ │ │ OnWorker() │ OnWorker() ▼ ▼┌──────────┐ OnElectedToBeLeader() ┌──────────┐│ WORKER │ ─────────────────────────────▶│ LEADER │└──────────┘ └──────────┘
Implementation
The OnElectionAction Struct
package app
type OnElectionAction struct { workersRegistry cluster.ServiceRegistry coordinatorsRegistry cluster.ServiceRegistry port int documentsDir string webServer *networking.WebServer currentRole string mu sync.Mutex}
const ( RoleLeader = "leader" RoleWorker = "worker" RoleNone = "none")
The mutex is crucial — role transitions can be triggered by ZooKeeper watches, which run in separate goroutines.
Transitioning to Leader
When elected as leader, we need to carefully tear down the worker infrastructure and build up the coordinator:
func (oea *OnElectionAction) OnElectedToBeLeader() { oea.mu.Lock() defer oea.mu.Unlock()
log.Printf("Transitioning to LEADER role on port %d", oea.port)
// Step 1: Unregister from workers registry if err := oea.workersRegistry.UnregisterFromCluster(); err != nil { log.Printf("Warning: Failed to unregister from workers: %v", err) }
// Step 2: Start watching for worker changes oea.workersRegistry.RegisterForUpdates()
// Step 3: Stop existing worker server if running if oea.webServer != nil && oea.webServer.IsRunning() { ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() oea.webServer.Stop(ctx) oea.webServer = nil }
// Step 4: Create and start coordinator server client := networking.NewWebClient() coordinator := search.NewSearchCoordinator( oea.workersRegistry, client, oea.documentsDir, ) oea.webServer = networking.NewWebServer(oea.port, coordinator) oea.webServer.StartAsync()
// Step 5: Register as coordinator address := fmt.Sprintf("http://localhost:%d/search", oea.port) oea.coordinatorsRegistry.RegisterToCluster(address)
oea.currentRole = RoleLeader log.Printf("Successfully transitioned to LEADER role")}
Transitioning to Worker
The worker transition is similar but with an important optimization — if we're already a worker, we don't restart the server:
func (oea *OnElectionAction) OnWorker() { oea.mu.Lock() defer oea.mu.Unlock()
log.Printf("Transitioning to WORKER role on port %d", oea.port)
// Optimization: If already a worker, just ensure registration if oea.currentRole == RoleWorker && oea.webServer != nil && oea.webServer.IsRunning() { log.Printf("Already running as worker, ensuring registration") oea.ensureWorkerRegistration() return }
// Stop any existing server if oea.webServer != nil && oea.webServer.IsRunning() { ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() oea.webServer.Stop(ctx) oea.webServer = nil }
// Unregister from coordinators if we were a leader oea.coordinatorsRegistry.UnregisterFromCluster()
// Create and start worker server worker := search.NewSearchWorker() oea.webServer = networking.NewWebServer(oea.port, worker) oea.webServer.StartAsync()
// Register as worker oea.ensureWorkerRegistration()
oea.currentRole = RoleWorker log.Printf("Successfully transitioned to WORKER role")}
Graceful Shutdown
When the application terminates, we need to clean up properly:
func (oea *OnElectionAction) Stop() error { oea.mu.Lock() defer oea.mu.Unlock()
log.Printf("Stopping OnElectionAction")
// Stop the web server if oea.webServer != nil && oea.webServer.IsRunning() { ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() oea.webServer.Stop(ctx) }
// Unregister from both registries oea.workersRegistry.UnregisterFromCluster() oea.coordinatorsRegistry.UnregisterFromCluster()
oea.currentRole = RoleNone return nil}
Testing Role Transitions
Testing distributed systems is notoriously difficult. We use a mock service registry to test the role transition logic in isolation:
func TestOnElectionAction_OnElectedToBeLeader(t *testing.T) { workersRegistry := cluster.NewMockServiceRegistry() coordinatorsRegistry := cluster.NewMockServiceRegistry()
action := NewOnElectionAction( workersRegistry, coordinatorsRegistry, 19082, "", )
// First become a worker action.OnWorker() time.Sleep(200 * time.Millisecond)
// Verify worker state if action.GetCurrentRole() != RoleWorker { t.Errorf("expected RoleWorker, got %s", action.GetCurrentRole()) }
// Now transition to leader action.OnElectedToBeLeader() time.Sleep(200 * time.Millisecond)
// Verify leader state if action.GetCurrentRole() != RoleLeader { t.Errorf("expected RoleLeader, got %s", action.GetCurrentRole()) }
// Verify unregistered from workers if workersRegistry.IsRegistered() { t.Error("expected to be unregistered from workers") }
// Verify registered as coordinator if !coordinatorsRegistry.IsRegistered() { t.Error("expected to be registered as coordinator") }
action.Stop()}
Testing Idempotency
An important property is that calling OnWorker() multiple times shouldn't restart the server:
func TestOnElectionAction_WorkerStartsServerOnlyOnce(t *testing.T) { workersRegistry := cluster.NewMockServiceRegistry() coordinatorsRegistry := cluster.NewMockServiceRegistry()
action := NewOnElectionAction( workersRegistry, coordinatorsRegistry, 19084, "", )
// First call to OnWorker action.OnWorker() time.Sleep(200 * time.Millisecond)
firstServer := action.GetWebServer() firstRegisterCount := workersRegistry.GetRegisterCount()
// Second call to OnWorker action.OnWorker() time.Sleep(100 * time.Millisecond)
secondServer := action.GetWebServer()
// Server should be the same instance if firstServer != secondServer { t.Error("expected server to remain the same") }
// Register count should not increase (duplicate prevention) if workersRegistry.GetRegisterCount() != firstRegisterCount { t.Error("expected no additional registrations") }
action.Stop()}
The Complete Flow
Here's what happens when a 3-node cluster starts up:
Time Node A Node B Node C────────────────────────────────────────────────────────t=0 Connect to ZK Connect to ZK Connect to ZKt=1 Create c_001 Create c_002 Create c_003t=2 I'm smallest! Watch c_001 Watch c_002 → OnElectedToBeLeader() → Start coordinator → Register /searcht=3 → OnWorker() → OnWorker() → Start worker → Start worker → Register /task → Register /task
Now if Node A crashes:
Time Node A Node B Node C────────────────────────────────────────────────────────t=10 💥 CRASH Watch fires! (still watching B) c_001 deletedt=11 Re-elect... I'm smallest! → OnElectedToBeLeader() → Stop worker → Unregister /task → Start coordinator → Register /searcht=12 Watch fires! c_002 deleted Re-elect... Watch c_002 (new B) → OnWorker() (no-op)
The cluster heals itself automatically!
Key Design Decisions
1. Mutex for Thread Safety
ZooKeeper watches fire in separate goroutines. Without the mutex, we could have race conditions during transitions.
2. Graceful Server Shutdown
We use context.WithTimeout to ensure servers shut down cleanly, allowing in-flight requests to complete.
3. Idempotent Worker Transition
Calling OnWorker() when already a worker is a no-op. This prevents unnecessary server restarts during re-elections.
4. Separate Registries
Workers and coordinators have separate registries. This allows the coordinator to discover workers without seeing itself.
What's Next?
In the final post, we'll wire everything together in the main application:
- Command-line argument parsing
- Signal handling for graceful shutdown
- The complete startup sequence
Get the Code
git clone git@github.com:UnplugCharger/distributed_doc_search.gitgit checkout 06-role-transitionscd distributed-search-cluster-gogo test ./internal/app/... -v
This post is part of the "Distributed Document Search" series. Follow along as we build a production-ready search cluster from scratch.