Distributed Document Search In Golang Requirements Doc
July 31, 2024Requirements Document
Introduction
This document specifies the requirements for a Go implementation of a distributed search cluster system. The system enables distributed document search using the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm across multiple worker nodes, coordinated through Apache ZooKeeper for leader election and service discovery. The system supports automatic failover, dynamic scaling of worker nodes, and efficient parallel document processing.
Glossary
- Cluster: A group of interconnected nodes working together to provide distributed search functionality
- Node: A single instance of the application that can act as either a Coordinator or Worker
- Coordinator: The elected leader node that receives search queries, distributes work to workers, and aggregates results
- Worker: A follower node that processes document search tasks and returns term frequency data
- ZooKeeper: Apache ZooKeeper, a distributed coordination service used for leader election and service registry
- Znode: A data node in ZooKeeper's hierarchical namespace
- Ephemeral_Znode: A znode that exists only while the session that created it is active
- Sequential_Znode: A znode with a monotonically increasing counter appended to its name
- Leader_Election: The process by which nodes compete to become the Coordinator
- Service_Registry: A ZooKeeper-based registry where nodes register their HTTP endpoints for discovery
- TF-IDF: Term Frequency-Inverse Document Frequency, an algorithm for measuring document relevance
- Term_Frequency: The ratio of a term's occurrences to total words in a document
- Inverse_Document_Frequency: A measure of how much information a term provides across all documents
- Task: A work unit containing search terms and document paths sent from Coordinator to Worker
- Result: The response from a Worker containing document data with term frequencies
- DocumentData: A mapping of search terms to their frequencies within a single document
- Protobuf: Protocol Buffers, a binary serialization format used for external API communication
- HTTP_Endpoint: A URL path that accepts HTTP requests for specific functionality
Requirements
Requirement 1: ZooKeeper Connection Management
User Story: As a cluster node, I want to establish and maintain a connection to ZooKeeper, so that I can participate in leader election and service discovery.
Acceptance Criteria
- WHEN the application starts, THE Node SHALL connect to ZooKeeper at a configurable address with a configurable session timeout
- WHEN the ZooKeeper connection is established, THE Node SHALL log a success message indicating connection status
- WHEN the ZooKeeper connection is lost, THE Node SHALL notify internal components and initiate graceful shutdown
- WHILE connected to ZooKeeper, THE Node SHALL maintain the session through automatic heartbeats handled by the ZooKeeper client library
- WHEN the application terminates, THE Node SHALL close the ZooKeeper connection gracefully
Requirement 2: Leader Election
User Story: As a cluster node, I want to participate in leader election, so that exactly one node becomes the Coordinator while others become Workers.
Acceptance Criteria
- WHEN a Node joins the cluster, THE Leader_Election SHALL create an ephemeral sequential znode under the
/electionnamespace with prefixc_ - WHEN determining leadership, THE Leader_Election SHALL retrieve all children of
/election, sort them lexicographically, and identify the smallest znode as the leader - WHEN a Node's znode is the smallest, THE Leader_Election SHALL invoke the
onElectedToBeLeadercallback to transition the node to Coordinator role - WHEN a Node's znode is not the smallest, THE Leader_Election SHALL invoke the
onWorkercallback and set a watch on its immediate predecessor znode - WHEN a watched predecessor znode is deleted, THE Leader_Election SHALL trigger re-election to determine if the node should become the new leader
- IF the predecessor znode no longer exists during watch setup, THEN THE Leader_Election SHALL retry the election process
Requirement 3: Service Registry
User Story: As a cluster node, I want to register my HTTP endpoint in a service registry, so that other nodes can discover and communicate with me.
Acceptance Criteria
- WHEN the Service_Registry is initialized, THE Service_Registry SHALL create the registry parent znode if it does not exist
- THE Service_Registry SHALL support two separate registries:
/workers_service_registryfor Workers and/coordinators_service_registryfor Coordinators - WHEN a Node registers to the cluster, THE Service_Registry SHALL create an ephemeral sequential znode containing the node's HTTP endpoint address
- WHEN a Node is already registered, THE Service_Registry SHALL prevent duplicate registration and log a warning
- WHEN a Node unregisters from the cluster, THE Service_Registry SHALL delete its znode from the registry
- WHEN a Node requests all service addresses, THE Service_Registry SHALL return a list of all registered endpoint addresses
- WHEN a Node registers for updates, THE Service_Registry SHALL set a watch on the registry and update the cached address list when children change
- WHEN the registry children change, THE Service_Registry SHALL automatically refresh the cached list of service addresses
Requirement 4: Role Transition
User Story: As a cluster node, I want to transition between Coordinator and Worker roles based on election results, so that the cluster can dynamically adapt to node changes.
Acceptance Criteria
- WHEN a Node is elected as leader, THE Node SHALL unregister from the workers registry and register in the coordinators registry
- WHEN a Node is elected as leader, THE Node SHALL start an HTTP server with the
/searchendpoint and stop any existing worker server - WHEN a Node becomes a Worker, THE Node SHALL register in the workers registry with its
/taskendpoint address - WHEN a Node becomes a Worker, THE Node SHALL start an HTTP server with the
/taskendpoint if not already running - WHEN transitioning from Worker to Coordinator, THE Node SHALL stop the worker HTTP server before starting the coordinator server
- WHEN a Node is elected as leader, THE Node SHALL register for updates on the workers registry to track available workers
Requirement 5: HTTP Server
User Story: As a cluster node, I want to run an HTTP server, so that I can receive health checks and process requests appropriate to my role.
Acceptance Criteria
- THE HTTP_Server SHALL expose a
/statusendpoint that responds to GET requests with a health check message - THE HTTP_Server SHALL expose a dynamic endpoint (
/searchor/task) based on the node's current role - WHEN a POST request is received on the role-specific endpoint, THE HTTP_Server SHALL pass the request body to the appropriate handler and return the response
- WHEN a non-POST request is received on the role-specific endpoint, THE HTTP_Server SHALL reject the request
- THE HTTP_Server SHALL use a configurable port number provided at startup
- THE HTTP_Server SHALL handle concurrent requests using goroutines
- WHEN the HTTP_Server is stopped, THE HTTP_Server SHALL gracefully shutdown allowing in-flight requests to complete
Requirement 6: HTTP Client
User Story: As a Coordinator, I want to send HTTP requests to Workers, so that I can distribute search tasks and collect results.
Acceptance Criteria
- THE HTTP_Client SHALL send POST requests with binary payloads to worker endpoints
- THE HTTP_Client SHALL support concurrent requests to multiple workers using goroutines
- WHEN sending tasks to workers, THE HTTP_Client SHALL return results through channels for aggregation
- IF a worker request fails, THEN THE HTTP_Client SHALL handle the error gracefully without crashing the Coordinator
- THE HTTP_Client SHALL deserialize worker responses into Result structures
Requirement 7: Search Coordinator
User Story: As a Coordinator, I want to handle search queries by distributing work to available Workers and aggregating results, so that users can search across all documents efficiently.
Acceptance Criteria
- WHEN a search request is received, THE Search_Coordinator SHALL parse the protobuf Request message to extract the search query
- WHEN processing a search query, THE Search_Coordinator SHALL tokenize the query into individual search terms
- WHEN distributing work, THE Search_Coordinator SHALL retrieve the list of available workers from the workers service registry
- IF no workers are available, THEN THE Search_Coordinator SHALL return an empty response
- WHEN distributing work, THE Search_Coordinator SHALL split the document list evenly among available workers
- WHEN creating tasks, THE Search_Coordinator SHALL include the search terms and assigned document paths in each Task
- THE Search_Coordinator SHALL send tasks to all workers concurrently and wait for all responses
- WHEN aggregating results, THE Search_Coordinator SHALL merge all worker results and calculate final TF-IDF scores
- WHEN returning results, THE Search_Coordinator SHALL sort documents by score in descending order
- THE Search_Coordinator SHALL serialize the response as a protobuf Response message containing document statistics
Requirement 8: Search Worker
User Story: As a Worker, I want to process document search tasks, so that I can calculate term frequencies for my assigned documents.
Acceptance Criteria
- WHEN a task request is received, THE Search_Worker SHALL deserialize the Task from the request body
- WHEN processing a task, THE Search_Worker SHALL read each assigned document from the filesystem
- WHEN processing a document, THE Search_Worker SHALL parse the document content into individual words
- WHEN calculating term frequencies, THE Search_Worker SHALL compute the frequency for each search term in each document
- THE Search_Worker SHALL create a Result containing DocumentData for each processed document
- THE Search_Worker SHALL serialize the Result and return it in the response body
- IF a document cannot be read, THEN THE Search_Worker SHALL skip that document and continue processing others
Requirement 9: TF-IDF Algorithm
User Story: As a search system, I want to implement the TF-IDF algorithm, so that documents can be ranked by relevance to search queries.
Acceptance Criteria
- WHEN calculating term frequency, THE TFIDF SHALL compute the ratio of term occurrences to total words in the document (case-insensitive)
- WHEN creating document data, THE TFIDF SHALL calculate term frequency for each search term and store in DocumentData
- WHEN calculating document scores, THE TFIDF SHALL compute IDF as log10(total_documents / documents_containing_term) for each term
- IF a term appears in zero documents, THEN THE TFIDF SHALL use an IDF of 0 for that term
- WHEN calculating final scores, THE TFIDF SHALL sum the product of TF and IDF for each term
- WHEN parsing text, THE TFIDF SHALL split on punctuation (periods, commas, spaces, hyphens, question marks, exclamation marks, semicolons, colons) and newlines
Requirement 10: Data Models and Serialization
User Story: As a developer, I want well-defined data models with serialization support, so that nodes can exchange data reliably.
Acceptance Criteria
- THE Task model SHALL contain a list of search terms and a list of document paths
- THE Result model SHALL contain a map of document paths to DocumentData
- THE DocumentData model SHALL contain a map of terms to their frequencies (float64)
- WHEN serializing Task and Result for internal communication, THE Serializer SHALL use JSON encoding
- WHEN deserializing Task and Result, THE Serializer SHALL reconstruct the original data structures
- FOR ALL valid Task objects, serializing then deserializing SHALL produce an equivalent object (round-trip property)
- FOR ALL valid Result objects, serializing then deserializing SHALL produce an equivalent object (round-trip property)
Requirement 11: Protocol Buffers API
User Story: As an external client, I want to communicate with the search cluster using Protocol Buffers, so that I can efficiently send queries and receive results.
Acceptance Criteria
- THE Protobuf schema SHALL define a Request message with a required search_query string field
- THE Protobuf schema SHALL define a Response message with a repeated DocumentStats field
- THE DocumentStats message SHALL contain document_name (required), score (optional), document_size (optional), and author (optional) fields
- WHEN receiving a search request, THE Search_Coordinator SHALL parse the protobuf Request message
- WHEN sending a search response, THE Search_Coordinator SHALL serialize the Response as protobuf
Requirement 12: Application Lifecycle
User Story: As an operator, I want to start and stop cluster nodes cleanly, so that the cluster remains stable during deployments.
Acceptance Criteria
- WHEN the application starts, THE Application SHALL accept a port number as a command-line argument (defaulting to 8080)
- WHEN the application starts, THE Application SHALL connect to ZooKeeper, initialize service registries, and begin leader election
- WHILE running, THE Application SHALL block the main goroutine until a shutdown signal is received
- WHEN a shutdown signal is received, THE Application SHALL unregister from service registries and close the ZooKeeper connection
- THE Application SHALL handle SIGINT and SIGTERM signals for graceful shutdown
Requirement 13: Go-Specific Implementation
User Story: As a Go developer, I want the implementation to follow Go idioms and best practices, so that the code is maintainable and performant.
Acceptance Criteria
- THE implementation SHALL use the
go-zookeeper/zklibrary for ZooKeeper integration - THE implementation SHALL use the standard
net/httppackage for HTTP server and client - THE implementation SHALL use
google.golang.org/protobuffor Protocol Buffers - THE implementation SHALL use goroutines and channels for concurrent task distribution
- THE implementation SHALL use idiomatic Go error handling with explicit error returns
- THE implementation SHALL organize code into packages following Go conventions (cluster, networking, search, model)
- THE implementation SHALL define interfaces for testability (OnElectionCallback, RequestHandler)