Dynamic Role Transitions: Building a Self-Healing Distributed System in Go

In the previous post, we implemented leader election and service discovery using ZooKeeper. But knowing who the leader is only half the battle — we also need to act on that information.

Today we'll build the OnElectionAction component that handles role transitions, turning our nodes into a self-healing cluster that automatically adapts when nodes join, leave, or fail.

The Challenge

When a node's role changes, several things need to happen:

Becoming a Leader (Coordinator):

Stop serving as a worker
Unregister from the workers registry
Start watching for worker changes
Start the coordinator server
Register as a coordinator

Becoming a Worker:

Stop serving as coordinator (if applicable)
Unregister from coordinators registry
Start the worker server
Register as a worker

All of this needs to happen atomically and safely, even under concurrent access.

The State Machine

Our node can be in one of three states:

┌──────────┐     OnElectedToBeLeader()     ┌──────────┐
│   NONE   │ ─────────────────────────────▶│  LEADER  │
└──────────┘                               └──────────┘
     │                                          │
     │ OnWorker()                               │ OnWorker()
     ▼                                          ▼
┌──────────┐     OnElectedToBeLeader()     ┌──────────┐
│  WORKER  │ ─────────────────────────────▶│  LEADER  │
└──────────┘                               └──────────┘

Implementation

The OnElectionAction Struct

package app

type OnElectionAction struct {
    workersRegistry      cluster.ServiceRegistry
    coordinatorsRegistry cluster.ServiceRegistry
    port                 int
    documentsDir         string
    webServer            *networking.WebServer
    currentRole          string
    mu                   sync.Mutex
}

const (
    RoleLeader = "leader"
    RoleWorker = "worker"
    RoleNone   = "none"
)

The mutex is crucial — role transitions can be triggered by ZooKeeper watches, which run in separate goroutines.

Transitioning to Leader

When elected as leader, we need to carefully tear down the worker infrastructure and build up the coordinator:

func (oea *OnElectionAction) OnElectedToBeLeader() {
    oea.mu.Lock()
    defer oea.mu.Unlock()

    log.Printf("Transitioning to LEADER role on port %d", oea.port)

    // Step 1: Unregister from workers registry
    if err := oea.workersRegistry.UnregisterFromCluster(); err != nil {
        log.Printf("Warning: Failed to unregister from workers: %v", err)
    }

    // Step 2: Start watching for worker changes
    oea.workersRegistry.RegisterForUpdates()

    // Step 3: Stop existing worker server if running
    if oea.webServer != nil && oea.webServer.IsRunning() {
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        defer cancel()
        oea.webServer.Stop(ctx)
        oea.webServer = nil
    }

    // Step 4: Create and start coordinator server
    client := networking.NewWebClient()
    coordinator := search.NewSearchCoordinator(
        oea.workersRegistry,
        client,
        oea.documentsDir,
    )
    oea.webServer = networking.NewWebServer(oea.port, coordinator)
    oea.webServer.StartAsync()

    // Step 5: Register as coordinator
    address := fmt.Sprintf("http://localhost:%d/search", oea.port)
    oea.coordinatorsRegistry.RegisterToCluster(address)

    oea.currentRole = RoleLeader
    log.Printf("Successfully transitioned to LEADER role")
}

Transitioning to Worker

The worker transition is similar but with an important optimization — if we're already a worker, we don't restart the server:

func (oea *OnElectionAction) OnWorker() {
    oea.mu.Lock()
    defer oea.mu.Unlock()

    log.Printf("Transitioning to WORKER role on port %d", oea.port)

    // Optimization: If already a worker, just ensure registration
    if oea.currentRole == RoleWorker &&
       oea.webServer != nil &&
       oea.webServer.IsRunning() {
        log.Printf("Already running as worker, ensuring registration")
        oea.ensureWorkerRegistration()
        return
    }

    // Stop any existing server
    if oea.webServer != nil && oea.webServer.IsRunning() {
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        defer cancel()
        oea.webServer.Stop(ctx)
        oea.webServer = nil
    }

    // Unregister from coordinators if we were a leader
    oea.coordinatorsRegistry.UnregisterFromCluster()

    // Create and start worker server
    worker := search.NewSearchWorker()
    oea.webServer = networking.NewWebServer(oea.port, worker)
    oea.webServer.StartAsync()

    // Register as worker
    oea.ensureWorkerRegistration()

    oea.currentRole = RoleWorker
    log.Printf("Successfully transitioned to WORKER role")
}

Graceful Shutdown

When the application terminates, we need to clean up properly:

func (oea *OnElectionAction) Stop() error {
    oea.mu.Lock()
    defer oea.mu.Unlock()

    log.Printf("Stopping OnElectionAction")

    // Stop the web server
    if oea.webServer != nil && oea.webServer.IsRunning() {
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        defer cancel()
        oea.webServer.Stop(ctx)
    }

    // Unregister from both registries
    oea.workersRegistry.UnregisterFromCluster()
    oea.coordinatorsRegistry.UnregisterFromCluster()

    oea.currentRole = RoleNone
    return nil
}

Testing Role Transitions

Testing distributed systems is notoriously difficult. We use a mock service registry to test the role transition logic in isolation:

func TestOnElectionAction_OnElectedToBeLeader(t *testing.T) {
    workersRegistry := cluster.NewMockServiceRegistry()
    coordinatorsRegistry := cluster.NewMockServiceRegistry()

    action := NewOnElectionAction(
        workersRegistry,
        coordinatorsRegistry,
        19082,
        "",
    )

    // First become a worker
    action.OnWorker()
    time.Sleep(200 * time.Millisecond)

    // Verify worker state
    if action.GetCurrentRole() != RoleWorker {
        t.Errorf("expected RoleWorker, got %s", action.GetCurrentRole())
    }

    // Now transition to leader
    action.OnElectedToBeLeader()
    time.Sleep(200 * time.Millisecond)

    // Verify leader state
    if action.GetCurrentRole() != RoleLeader {
        t.Errorf("expected RoleLeader, got %s", action.GetCurrentRole())
    }

    // Verify unregistered from workers
    if workersRegistry.IsRegistered() {
        t.Error("expected to be unregistered from workers")
    }

    // Verify registered as coordinator
    if !coordinatorsRegistry.IsRegistered() {
        t.Error("expected to be registered as coordinator")
    }

    action.Stop()
}

Testing Idempotency

An important property is that calling OnWorker() multiple times shouldn't restart the server:

func TestOnElectionAction_WorkerStartsServerOnlyOnce(t *testing.T) {
    workersRegistry := cluster.NewMockServiceRegistry()
    coordinatorsRegistry := cluster.NewMockServiceRegistry()

    action := NewOnElectionAction(
        workersRegistry,
        coordinatorsRegistry,
        19084,
        "",
    )

    // First call to OnWorker
    action.OnWorker()
    time.Sleep(200 * time.Millisecond)

    firstServer := action.GetWebServer()
    firstRegisterCount := workersRegistry.GetRegisterCount()

    // Second call to OnWorker
    action.OnWorker()
    time.Sleep(100 * time.Millisecond)

    secondServer := action.GetWebServer()

    // Server should be the same instance
    if firstServer != secondServer {
        t.Error("expected server to remain the same")
    }

    // Register count should not increase (duplicate prevention)
    if workersRegistry.GetRegisterCount() != firstRegisterCount {
        t.Error("expected no additional registrations")
    }

    action.Stop()
}

The Complete Flow

Here's what happens when a 3-node cluster starts up:

Time    Node A              Node B              Node C
────────────────────────────────────────────────────────
t=0     Connect to ZK       Connect to ZK       Connect to ZK
t=1     Create c_001        Create c_002        Create c_003
t=2     I'm smallest!       Watch c_001         Watch c_002
        → OnElectedToBeLeader()
        → Start coordinator
        → Register /search
t=3                         → OnWorker()        → OnWorker()
                            → Start worker      → Start worker
                            → Register /task    → Register /task

Now if Node A crashes:

Time    Node A              Node B              Node C
────────────────────────────────────────────────────────
t=10    💥 CRASH            Watch fires!        (still watching B)
                            c_001 deleted
t=11                        Re-elect...
                            I'm smallest!
                            → OnElectedToBeLeader()
                            → Stop worker
                            → Unregister /task
                            → Start coordinator
                            → Register /search
t=12                                            Watch fires!
                                                c_002 deleted
                                                Re-elect...
                                                Watch c_002 (new B)
                                                → OnWorker() (no-op)

The cluster heals itself automatically!

Key Design Decisions

1. Mutex for Thread Safety

ZooKeeper watches fire in separate goroutines. Without the mutex, we could have race conditions during transitions.

2. Graceful Server Shutdown

We use context.WithTimeout to ensure servers shut down cleanly, allowing in-flight requests to complete.

3. Idempotent Worker Transition

Calling OnWorker() when already a worker is a no-op. This prevents unnecessary server restarts during re-elections.

4. Separate Registries

Workers and coordinators have separate registries. This allows the coordinator to discover workers without seeing itself.

What's Next?

In the final post, we'll wire everything together in the main application:

Command-line argument parsing
Signal handling for graceful shutdown
The complete startup sequence

Get the Code

git clone git@github.com:UnplugCharger/distributed_doc_search.git
git checkout 06-role-transitions
cd distributed-search-cluster-go
go test ./internal/app/... -v

This post is part of the "Distributed Document Search" series. Follow along as we build a production-ready search cluster from scratch.