Building a Document Search Engine: TF-IDF from Scratch in Go
February 2, 2026Building a Document Search Engine: TF-IDF from Scratch in Go
Ever wondered how search engines rank documents by relevance? In this post, we'll build the core ranking algorithm from scratch: TF-IDF (Term Frequency-Inverse Document Frequency).
By the end, you'll have a working search engine that can find "Moby Dick" when you search for "whale ocean sea captain" across a collection of 20 classic books.
What is TF-IDF?
TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection. It's the product of two metrics:
- Term Frequency (TF): How often a term appears in a document
- Inverse Document Frequency (IDF): How rare the term is across all documents
The intuition is simple: a word that appears frequently in one document but rarely in others is probably important for that document.
The Math
Term Frequency
TF(term, document) = count(term in document) / total_words(document)
This gives us a value between 0 and 1.
Inverse Document Frequency
IDF(term) = log₁₀(total_documents / documents_containing_term)
Rare terms get higher IDF scores. Common terms like "the" appear in every document, so their IDF approaches 0.
Final Score
TF-IDF(term, document) = TF × IDF
For multiple search terms, we sum the TF-IDF scores:
Score(document) = Σ TF-IDF(term, document) for each term
Implementation in Go
Let's build this step by step.
Step 1: Text Parsing
First, we need to split text into words. We'll handle common delimiters:
package search
import "strings"
const delimiters = ".,;:!?- \t\n\r"
// GetWordsFromLine splits a line into words using delimitersfunc GetWordsFromLine(line string) []string { var words []string fields := strings.FieldsFunc(line, func(r rune) bool { return strings.ContainsRune(delimiters, r) }) for _, word := range fields { trimmed := strings.TrimSpace(word) if trimmed != "" { words = append(words, trimmed) } } return words}
// GetWordsFromDocument processes multiple linesfunc GetWordsFromDocument(lines []string) []string { var allWords []string for _, line := range lines { allWords = append(allWords, GetWordsFromLine(line)...) } return allWords}
Step 2: Term Frequency
Calculate how often a term appears in a document (case-insensitive):
func CalculateTermFrequency(words []string, term string) float64 { if len(words) == 0 { return 0.0 }
termLower := strings.ToLower(term) count := 0 for _, word := range words { if strings.ToLower(word) == termLower { count++ } }
return float64(count) / float64(len(words))}
Step 3: Document Data Model
We need a structure to store term frequencies for each document:
type DocumentData struct { TermToFrequency map[string]float64 `json:"term_to_frequency"`}
func NewDocumentData() *DocumentData { return &DocumentData{ TermToFrequency: make(map[string]float64), }}
func CreateDocumentData(words []string, terms []string) *DocumentData { docData := NewDocumentData() for _, term := range terms { tf := CalculateTermFrequency(words, term) docData.TermToFrequency[term] = tf } return docData}
Step 4: IDF Calculation
func getInverseDocumentFrequency(term string, documentResults map[string]*DocumentData) float64 { totalDocs := len(documentResults) if totalDocs == 0 { return 0.0 }
docsContainingTerm := 0 for _, docData := range documentResults { if docData.TermToFrequency[term] > 0 { docsContainingTerm++ } }
if docsContainingTerm == 0 { return 0.0 }
return math.Log10(float64(totalDocs) / float64(docsContainingTerm))}
Step 5: Putting It All Together
func GetDocumentsScores(terms []string, documentResults map[string]*DocumentData) map[float64][]string { // Calculate IDF for each term termIDFs := make(map[string]float64) for _, term := range terms { termIDFs[term] = getInverseDocumentFrequency(term, documentResults) }
// Calculate TF-IDF score for each document docScores := make(map[string]float64) for docPath, docData := range documentResults { score := 0.0 for _, term := range terms { tf := docData.TermToFrequency[term] idf := termIDFs[term] score += tf * idf } docScores[docPath] = score }
// Group documents by score scoreToDocuments := make(map[float64][]string) for docPath, score := range docScores { scoreToDocuments[score] = append(scoreToDocuments[score], docPath) }
return scoreToDocuments}
Testing with Property-Based Tests
Instead of just writing example-based tests, we use property-based testing to verify our implementation holds for any input. Here's an example using gopter:
func TestTermFrequencyCalculation(t *testing.T) { properties := gopter.NewProperties(gopter.DefaultTestParameters())
// Property: TF is always between 0 and 1 properties.Property("TF is between 0 and 1", prop.ForAll( func(words []string, term string) bool { tf := search.CalculateTermFrequency(words, term) return tf >= 0.0 && tf <= 1.0 }, gen.SliceOf(gen.AlphaString()), gen.AlphaString(), ))
// Property: TF is case-insensitive properties.Property("TF is case-insensitive", prop.ForAll( func(words []string, term string) bool { tfLower := search.CalculateTermFrequency(words, strings.ToLower(term)) tfUpper := search.CalculateTermFrequency(words, strings.ToUpper(term)) return tfLower == tfUpper }, gen.SliceOf(gen.AlphaString()), gen.AlphaString(), ))
properties.TestingRun(t)}
This runs 100 random test cases for each property, giving us much higher confidence than a handful of hand-picked examples.
Demo: Searching Classic Literature
Let's test our implementation on 20 classic books:
func main() { query := "whale ocean sea captain" searchTerms := search.GetWordsFromLine(query)
// Process all books documentResults := make(map[string]*model.DocumentData) files, _ := os.ReadDir("resources/books")
for _, file := range files { words := readDocumentWords(filepath.Join("resources/books", file.Name())) documentResults[file.Name()] = search.CreateDocumentData(words, searchTerms) }
// Get ranked results scores := search.GetDocumentsScores(searchTerms, documentResults) // ... display results}
Results
Searching for: "whale ocean sea captain"
=== Search Results (ranked by TF-IDF score) ===
1. [0.003961] Moby Dick.txt 2. [0.000133] Frankenstein.txt 3. [0.000091] The Iliad of Homer by Homer.txt 4. [0.000084] The Count of Monte Cristo.txt 5. [0.000078] Heart of Darkness.txt...
Moby Dick ranks #1 with a score 30x higher than the second result. The algorithm works!
Let's try another search:
Searching for: "detective murder mystery crime"
1. [0.000139] The Adventures of Sherlock Holmes.txt 2. [0.000112] The Souls of Black Folk.txt 3. [0.000093] The Strange Case Of Dr. Jekyll And Mr. Hyde.txt 4. [0.000089] Crime And Punishment.txt
Sherlock Holmes takes the top spot for detective-related queries.
Key Takeaways
- TF-IDF is elegant: Just two simple metrics combined give surprisingly good results
- Case-insensitivity matters: "Whale" and "whale" should match
- Property-based testing catches edge cases: Random inputs find bugs you'd never think to test
- The math is intuitive: Frequent + rare = important
What's Next?
This is part 1 of a series on building a distributed document search system. In upcoming posts, we'll add:
- Part 2: HTTP networking layer for distributed workers
- Part 3: Search coordinator for parallel processing
- Part 4: ZooKeeper integration for leader election
- Part 5: Full cluster deployment
Get the Code
The complete implementation is available on GitHub:
git clone git@github.com:UnplugCharger/distributed_doc_search.gitgit checkout 01-tfidf-implementationcd distributed-search-cluster-gogo run ./cmd/node/main.go whale ocean sea captain
This post is part of the "Distributed Document Search" series. Follow along as we build a production-ready search cluster from scratch.