Dynamic Role Transitions: Building a Self-Healing Distributed System in Go

February 2, 2026

Dynamic Role Transitions: Building a Self-Healing Distributed System in Go

In the previous post, we implemented leader election and service discovery using ZooKeeper. But knowing who the leader is only half the battle — we also need to act on that information.

Today we'll build the OnElectionAction component that handles role transitions, turning our nodes into a self-healing cluster that automatically adapts when nodes join, leave, or fail.

The Challenge

When a node's role changes, several things need to happen:

Becoming a Leader (Coordinator):

  1. Stop serving as a worker
  2. Unregister from the workers registry
  3. Start watching for worker changes
  4. Start the coordinator server
  5. Register as a coordinator

Becoming a Worker:

  1. Stop serving as coordinator (if applicable)
  2. Unregister from coordinators registry
  3. Start the worker server
  4. Register as a worker

All of this needs to happen atomically and safely, even under concurrent access.

The State Machine

Our node can be in one of three states:

┌──────────┐ OnElectedToBeLeader() ┌──────────┐
│ NONE │ ─────────────────────────────▶│ LEADER │
└──────────┘ └──────────┘
│ │
│ OnWorker() │ OnWorker()
▼ ▼
┌──────────┐ OnElectedToBeLeader() ┌──────────┐
│ WORKER │ ─────────────────────────────▶│ LEADER │
└──────────┘ └──────────┘

Implementation

The OnElectionAction Struct

package app
type OnElectionAction struct {
workersRegistry cluster.ServiceRegistry
coordinatorsRegistry cluster.ServiceRegistry
port int
documentsDir string
webServer *networking.WebServer
currentRole string
mu sync.Mutex
}
const (
RoleLeader = "leader"
RoleWorker = "worker"
RoleNone = "none"
)

The mutex is crucial — role transitions can be triggered by ZooKeeper watches, which run in separate goroutines.

Transitioning to Leader

When elected as leader, we need to carefully tear down the worker infrastructure and build up the coordinator:

func (oea *OnElectionAction) OnElectedToBeLeader() {
oea.mu.Lock()
defer oea.mu.Unlock()
log.Printf("Transitioning to LEADER role on port %d", oea.port)
// Step 1: Unregister from workers registry
if err := oea.workersRegistry.UnregisterFromCluster(); err != nil {
log.Printf("Warning: Failed to unregister from workers: %v", err)
}
// Step 2: Start watching for worker changes
oea.workersRegistry.RegisterForUpdates()
// Step 3: Stop existing worker server if running
if oea.webServer != nil && oea.webServer.IsRunning() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
oea.webServer.Stop(ctx)
oea.webServer = nil
}
// Step 4: Create and start coordinator server
client := networking.NewWebClient()
coordinator := search.NewSearchCoordinator(
oea.workersRegistry,
client,
oea.documentsDir,
)
oea.webServer = networking.NewWebServer(oea.port, coordinator)
oea.webServer.StartAsync()
// Step 5: Register as coordinator
address := fmt.Sprintf("http://localhost:%d/search", oea.port)
oea.coordinatorsRegistry.RegisterToCluster(address)
oea.currentRole = RoleLeader
log.Printf("Successfully transitioned to LEADER role")
}

Transitioning to Worker

The worker transition is similar but with an important optimization — if we're already a worker, we don't restart the server:

func (oea *OnElectionAction) OnWorker() {
oea.mu.Lock()
defer oea.mu.Unlock()
log.Printf("Transitioning to WORKER role on port %d", oea.port)
// Optimization: If already a worker, just ensure registration
if oea.currentRole == RoleWorker &&
oea.webServer != nil &&
oea.webServer.IsRunning() {
log.Printf("Already running as worker, ensuring registration")
oea.ensureWorkerRegistration()
return
}
// Stop any existing server
if oea.webServer != nil && oea.webServer.IsRunning() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
oea.webServer.Stop(ctx)
oea.webServer = nil
}
// Unregister from coordinators if we were a leader
oea.coordinatorsRegistry.UnregisterFromCluster()
// Create and start worker server
worker := search.NewSearchWorker()
oea.webServer = networking.NewWebServer(oea.port, worker)
oea.webServer.StartAsync()
// Register as worker
oea.ensureWorkerRegistration()
oea.currentRole = RoleWorker
log.Printf("Successfully transitioned to WORKER role")
}

Graceful Shutdown

When the application terminates, we need to clean up properly:

func (oea *OnElectionAction) Stop() error {
oea.mu.Lock()
defer oea.mu.Unlock()
log.Printf("Stopping OnElectionAction")
// Stop the web server
if oea.webServer != nil && oea.webServer.IsRunning() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
oea.webServer.Stop(ctx)
}
// Unregister from both registries
oea.workersRegistry.UnregisterFromCluster()
oea.coordinatorsRegistry.UnregisterFromCluster()
oea.currentRole = RoleNone
return nil
}

Testing Role Transitions

Testing distributed systems is notoriously difficult. We use a mock service registry to test the role transition logic in isolation:

func TestOnElectionAction_OnElectedToBeLeader(t *testing.T) {
workersRegistry := cluster.NewMockServiceRegistry()
coordinatorsRegistry := cluster.NewMockServiceRegistry()
action := NewOnElectionAction(
workersRegistry,
coordinatorsRegistry,
19082,
"",
)
// First become a worker
action.OnWorker()
time.Sleep(200 * time.Millisecond)
// Verify worker state
if action.GetCurrentRole() != RoleWorker {
t.Errorf("expected RoleWorker, got %s", action.GetCurrentRole())
}
// Now transition to leader
action.OnElectedToBeLeader()
time.Sleep(200 * time.Millisecond)
// Verify leader state
if action.GetCurrentRole() != RoleLeader {
t.Errorf("expected RoleLeader, got %s", action.GetCurrentRole())
}
// Verify unregistered from workers
if workersRegistry.IsRegistered() {
t.Error("expected to be unregistered from workers")
}
// Verify registered as coordinator
if !coordinatorsRegistry.IsRegistered() {
t.Error("expected to be registered as coordinator")
}
action.Stop()
}

Testing Idempotency

An important property is that calling OnWorker() multiple times shouldn't restart the server:

func TestOnElectionAction_WorkerStartsServerOnlyOnce(t *testing.T) {
workersRegistry := cluster.NewMockServiceRegistry()
coordinatorsRegistry := cluster.NewMockServiceRegistry()
action := NewOnElectionAction(
workersRegistry,
coordinatorsRegistry,
19084,
"",
)
// First call to OnWorker
action.OnWorker()
time.Sleep(200 * time.Millisecond)
firstServer := action.GetWebServer()
firstRegisterCount := workersRegistry.GetRegisterCount()
// Second call to OnWorker
action.OnWorker()
time.Sleep(100 * time.Millisecond)
secondServer := action.GetWebServer()
// Server should be the same instance
if firstServer != secondServer {
t.Error("expected server to remain the same")
}
// Register count should not increase (duplicate prevention)
if workersRegistry.GetRegisterCount() != firstRegisterCount {
t.Error("expected no additional registrations")
}
action.Stop()
}

The Complete Flow

Here's what happens when a 3-node cluster starts up:

Time Node A Node B Node C
────────────────────────────────────────────────────────
t=0 Connect to ZK Connect to ZK Connect to ZK
t=1 Create c_001 Create c_002 Create c_003
t=2 I'm smallest! Watch c_001 Watch c_002
→ OnElectedToBeLeader()
→ Start coordinator
→ Register /search
t=3 → OnWorker() → OnWorker()
→ Start worker → Start worker
→ Register /task → Register /task

Now if Node A crashes:

Time Node A Node B Node C
────────────────────────────────────────────────────────
t=10 💥 CRASH Watch fires! (still watching B)
c_001 deleted
t=11 Re-elect...
I'm smallest!
→ OnElectedToBeLeader()
→ Stop worker
→ Unregister /task
→ Start coordinator
→ Register /search
t=12 Watch fires!
c_002 deleted
Re-elect...
Watch c_002 (new B)
→ OnWorker() (no-op)

The cluster heals itself automatically!

Key Design Decisions

1. Mutex for Thread Safety

ZooKeeper watches fire in separate goroutines. Without the mutex, we could have race conditions during transitions.

2. Graceful Server Shutdown

We use context.WithTimeout to ensure servers shut down cleanly, allowing in-flight requests to complete.

3. Idempotent Worker Transition

Calling OnWorker() when already a worker is a no-op. This prevents unnecessary server restarts during re-elections.

4. Separate Registries

Workers and coordinators have separate registries. This allows the coordinator to discover workers without seeing itself.

What's Next?

In the final post, we'll wire everything together in the main application:

  • Command-line argument parsing
  • Signal handling for graceful shutdown
  • The complete startup sequence

Get the Code

git clone git@github.com:UnplugCharger/distributed_doc_search.git
git checkout 06-role-transitions
cd distributed-search-cluster-go
go test ./internal/app/... -v

This post is part of the "Distributed Document Search" series. Follow along as we build a production-ready search cluster from scratch.