Building a Document Search Engine: TF-IDF from Scratch in Go

February 2, 2026

Building a Document Search Engine: TF-IDF from Scratch in Go

Ever wondered how search engines rank documents by relevance? In this post, we'll build the core ranking algorithm from scratch: TF-IDF (Term Frequency-Inverse Document Frequency).

By the end, you'll have a working search engine that can find "Moby Dick" when you search for "whale ocean sea captain" across a collection of 20 classic books.

What is TF-IDF?

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection. It's the product of two metrics:

  • Term Frequency (TF): How often a term appears in a document
  • Inverse Document Frequency (IDF): How rare the term is across all documents

The intuition is simple: a word that appears frequently in one document but rarely in others is probably important for that document.

The Math

Term Frequency

TF(term, document) = count(term in document) / total_words(document)

This gives us a value between 0 and 1.

Inverse Document Frequency

IDF(term) = log₁₀(total_documents / documents_containing_term)

Rare terms get higher IDF scores. Common terms like "the" appear in every document, so their IDF approaches 0.

Final Score

TF-IDF(term, document) = TF × IDF

For multiple search terms, we sum the TF-IDF scores:

Score(document) = Σ TF-IDF(term, document) for each term

Implementation in Go

Let's build this step by step.

Step 1: Text Parsing

First, we need to split text into words. We'll handle common delimiters:

package search
import "strings"
const delimiters = ".,;:!?- \t\n\r"
// GetWordsFromLine splits a line into words using delimiters
func GetWordsFromLine(line string) []string {
var words []string
fields := strings.FieldsFunc(line, func(r rune) bool {
return strings.ContainsRune(delimiters, r)
})
for _, word := range fields {
trimmed := strings.TrimSpace(word)
if trimmed != "" {
words = append(words, trimmed)
}
}
return words
}
// GetWordsFromDocument processes multiple lines
func GetWordsFromDocument(lines []string) []string {
var allWords []string
for _, line := range lines {
allWords = append(allWords, GetWordsFromLine(line)...)
}
return allWords
}

Step 2: Term Frequency

Calculate how often a term appears in a document (case-insensitive):

func CalculateTermFrequency(words []string, term string) float64 {
if len(words) == 0 {
return 0.0
}
termLower := strings.ToLower(term)
count := 0
for _, word := range words {
if strings.ToLower(word) == termLower {
count++
}
}
return float64(count) / float64(len(words))
}

Step 3: Document Data Model

We need a structure to store term frequencies for each document:

type DocumentData struct {
TermToFrequency map[string]float64 `json:"term_to_frequency"`
}
func NewDocumentData() *DocumentData {
return &DocumentData{
TermToFrequency: make(map[string]float64),
}
}
func CreateDocumentData(words []string, terms []string) *DocumentData {
docData := NewDocumentData()
for _, term := range terms {
tf := CalculateTermFrequency(words, term)
docData.TermToFrequency[term] = tf
}
return docData
}

Step 4: IDF Calculation

func getInverseDocumentFrequency(term string, documentResults map[string]*DocumentData) float64 {
totalDocs := len(documentResults)
if totalDocs == 0 {
return 0.0
}
docsContainingTerm := 0
for _, docData := range documentResults {
if docData.TermToFrequency[term] > 0 {
docsContainingTerm++
}
}
if docsContainingTerm == 0 {
return 0.0
}
return math.Log10(float64(totalDocs) / float64(docsContainingTerm))
}

Step 5: Putting It All Together

func GetDocumentsScores(terms []string, documentResults map[string]*DocumentData) map[float64][]string {
// Calculate IDF for each term
termIDFs := make(map[string]float64)
for _, term := range terms {
termIDFs[term] = getInverseDocumentFrequency(term, documentResults)
}
// Calculate TF-IDF score for each document
docScores := make(map[string]float64)
for docPath, docData := range documentResults {
score := 0.0
for _, term := range terms {
tf := docData.TermToFrequency[term]
idf := termIDFs[term]
score += tf * idf
}
docScores[docPath] = score
}
// Group documents by score
scoreToDocuments := make(map[float64][]string)
for docPath, score := range docScores {
scoreToDocuments[score] = append(scoreToDocuments[score], docPath)
}
return scoreToDocuments
}

Testing with Property-Based Tests

Instead of just writing example-based tests, we use property-based testing to verify our implementation holds for any input. Here's an example using gopter:

func TestTermFrequencyCalculation(t *testing.T) {
properties := gopter.NewProperties(gopter.DefaultTestParameters())
// Property: TF is always between 0 and 1
properties.Property("TF is between 0 and 1", prop.ForAll(
func(words []string, term string) bool {
tf := search.CalculateTermFrequency(words, term)
return tf >= 0.0 && tf <= 1.0
},
gen.SliceOf(gen.AlphaString()),
gen.AlphaString(),
))
// Property: TF is case-insensitive
properties.Property("TF is case-insensitive", prop.ForAll(
func(words []string, term string) bool {
tfLower := search.CalculateTermFrequency(words, strings.ToLower(term))
tfUpper := search.CalculateTermFrequency(words, strings.ToUpper(term))
return tfLower == tfUpper
},
gen.SliceOf(gen.AlphaString()),
gen.AlphaString(),
))
properties.TestingRun(t)
}

This runs 100 random test cases for each property, giving us much higher confidence than a handful of hand-picked examples.

Demo: Searching Classic Literature

Let's test our implementation on 20 classic books:

func main() {
query := "whale ocean sea captain"
searchTerms := search.GetWordsFromLine(query)
// Process all books
documentResults := make(map[string]*model.DocumentData)
files, _ := os.ReadDir("resources/books")
for _, file := range files {
words := readDocumentWords(filepath.Join("resources/books", file.Name()))
documentResults[file.Name()] = search.CreateDocumentData(words, searchTerms)
}
// Get ranked results
scores := search.GetDocumentsScores(searchTerms, documentResults)
// ... display results
}

Results

Searching for: "whale ocean sea captain"
=== Search Results (ranked by TF-IDF score) ===
1. [0.003961] Moby Dick.txt
2. [0.000133] Frankenstein.txt
3. [0.000091] The Iliad of Homer by Homer.txt
4. [0.000084] The Count of Monte Cristo.txt
5. [0.000078] Heart of Darkness.txt
...

Moby Dick ranks #1 with a score 30x higher than the second result. The algorithm works!

Let's try another search:

Searching for: "detective murder mystery crime"
1. [0.000139] The Adventures of Sherlock Holmes.txt
2. [0.000112] The Souls of Black Folk.txt
3. [0.000093] The Strange Case Of Dr. Jekyll And Mr. Hyde.txt
4. [0.000089] Crime And Punishment.txt

Sherlock Holmes takes the top spot for detective-related queries.

Key Takeaways

  1. TF-IDF is elegant: Just two simple metrics combined give surprisingly good results
  2. Case-insensitivity matters: "Whale" and "whale" should match
  3. Property-based testing catches edge cases: Random inputs find bugs you'd never think to test
  4. The math is intuitive: Frequent + rare = important

What's Next?

This is part 1 of a series on building a distributed document search system. In upcoming posts, we'll add:

  • Part 2: HTTP networking layer for distributed workers
  • Part 3: Search coordinator for parallel processing
  • Part 4: ZooKeeper integration for leader election
  • Part 5: Full cluster deployment

Get the Code

The complete implementation is available on GitHub:

git clone git@github.com:UnplugCharger/distributed_doc_search.git
git checkout 01-tfidf-implementation
cd distributed-search-cluster-go
go run ./cmd/node/main.go whale ocean sea captain

This post is part of the "Distributed Document Search" series. Follow along as we build a production-ready search cluster from scratch.