Building a Document Search Engine: TF-IDF from Scratch in Go

Ever wondered how search engines rank documents by relevance? In this post, we'll build the core ranking algorithm from scratch: TF-IDF (Term Frequency-Inverse Document Frequency).

By the end, you'll have a working search engine that can find "Moby Dick" when you search for "whale ocean sea captain" across a collection of 20 classic books.

What is TF-IDF?

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection. It's the product of two metrics:

Term Frequency (TF): How often a term appears in a document
Inverse Document Frequency (IDF): How rare the term is across all documents

The intuition is simple: a word that appears frequently in one document but rarely in others is probably important for that document.

The Math

Term Frequency

TF(term, document) = count(term in document) / total_words(document)

This gives us a value between 0 and 1.

Inverse Document Frequency

IDF(term) = log₁₀(total_documents / documents_containing_term)

Rare terms get higher IDF scores. Common terms like "the" appear in every document, so their IDF approaches 0.

Final Score

TF-IDF(term, document) = TF × IDF

For multiple search terms, we sum the TF-IDF scores:

Score(document) = Σ TF-IDF(term, document) for each term

Implementation in Go

Let's build this step by step.

Step 1: Text Parsing

First, we need to split text into words. We'll handle common delimiters:

package search

import "strings"

const delimiters = ".,;:!?- \t\n\r"

// GetWordsFromLine splits a line into words using delimiters
func GetWordsFromLine(line string) []string {
    var words []string
    fields := strings.FieldsFunc(line, func(r rune) bool {
        return strings.ContainsRune(delimiters, r)
    })
    for _, word := range fields {
        trimmed := strings.TrimSpace(word)
        if trimmed != "" {
            words = append(words, trimmed)
        }
    }
    return words
}

// GetWordsFromDocument processes multiple lines
func GetWordsFromDocument(lines []string) []string {
    var allWords []string
    for _, line := range lines {
        allWords = append(allWords, GetWordsFromLine(line)...)
    }
    return allWords
}

Step 2: Term Frequency

Calculate how often a term appears in a document (case-insensitive):

func CalculateTermFrequency(words []string, term string) float64 {
    if len(words) == 0 {
        return 0.0
    }

    termLower := strings.ToLower(term)
    count := 0
    for _, word := range words {
        if strings.ToLower(word) == termLower {
            count++
        }
    }

    return float64(count) / float64(len(words))
}

Step 3: Document Data Model

We need a structure to store term frequencies for each document:

type DocumentData struct {
    TermToFrequency map[string]float64 `json:"term_to_frequency"`
}

func NewDocumentData() *DocumentData {
    return &DocumentData{
        TermToFrequency: make(map[string]float64),
    }
}

func CreateDocumentData(words []string, terms []string) *DocumentData {
    docData := NewDocumentData()
    for _, term := range terms {
        tf := CalculateTermFrequency(words, term)
        docData.TermToFrequency[term] = tf
    }
    return docData
}

Step 4: IDF Calculation

func getInverseDocumentFrequency(term string, documentResults map[string]*DocumentData) float64 {
    totalDocs := len(documentResults)
    if totalDocs == 0 {
        return 0.0
    }

    docsContainingTerm := 0
    for _, docData := range documentResults {
        if docData.TermToFrequency[term] > 0 {
            docsContainingTerm++
        }
    }

    if docsContainingTerm == 0 {
        return 0.0
    }

    return math.Log10(float64(totalDocs) / float64(docsContainingTerm))
}

Step 5: Putting It All Together

func GetDocumentsScores(terms []string, documentResults map[string]*DocumentData) map[float64][]string {
    // Calculate IDF for each term
    termIDFs := make(map[string]float64)
    for _, term := range terms {
        termIDFs[term] = getInverseDocumentFrequency(term, documentResults)
    }

    // Calculate TF-IDF score for each document
    docScores := make(map[string]float64)
    for docPath, docData := range documentResults {
        score := 0.0
        for _, term := range terms {
            tf := docData.TermToFrequency[term]
            idf := termIDFs[term]
            score += tf * idf
        }
        docScores[docPath] = score
    }

    // Group documents by score
    scoreToDocuments := make(map[float64][]string)
    for docPath, score := range docScores {
        scoreToDocuments[score] = append(scoreToDocuments[score], docPath)
    }

    return scoreToDocuments
}

Testing with Property-Based Tests

Instead of just writing example-based tests, we use property-based testing to verify our implementation holds for any input. Here's an example using gopter:

func TestTermFrequencyCalculation(t *testing.T) {
    properties := gopter.NewProperties(gopter.DefaultTestParameters())

    // Property: TF is always between 0 and 1
    properties.Property("TF is between 0 and 1", prop.ForAll(
        func(words []string, term string) bool {
            tf := search.CalculateTermFrequency(words, term)
            return tf >= 0.0 && tf <= 1.0
        },
        gen.SliceOf(gen.AlphaString()),
        gen.AlphaString(),
    ))

    // Property: TF is case-insensitive
    properties.Property("TF is case-insensitive", prop.ForAll(
        func(words []string, term string) bool {
            tfLower := search.CalculateTermFrequency(words, strings.ToLower(term))
            tfUpper := search.CalculateTermFrequency(words, strings.ToUpper(term))
            return tfLower == tfUpper
        },
        gen.SliceOf(gen.AlphaString()),
        gen.AlphaString(),
    ))

    properties.TestingRun(t)
}

This runs 100 random test cases for each property, giving us much higher confidence than a handful of hand-picked examples.

Demo: Searching Classic Literature

Let's test our implementation on 20 classic books:

func main() {
    query := "whale ocean sea captain"
    searchTerms := search.GetWordsFromLine(query)

    // Process all books
    documentResults := make(map[string]*model.DocumentData)
    files, _ := os.ReadDir("resources/books")

    for _, file := range files {
        words := readDocumentWords(filepath.Join("resources/books", file.Name()))
        documentResults[file.Name()] = search.CreateDocumentData(words, searchTerms)
    }

    // Get ranked results
    scores := search.GetDocumentsScores(searchTerms, documentResults)
    // ... display results
}

Results

Searching for: "whale ocean sea captain"

=== Search Results (ranked by TF-IDF score) ===

 1. [0.003961] Moby Dick.txt
 2. [0.000133] Frankenstein.txt
 3. [0.000091] The Iliad of Homer by Homer.txt
 4. [0.000084] The Count of Monte Cristo.txt
 5. [0.000078] Heart of Darkness.txt
...

Moby Dick ranks #1 with a score 30x higher than the second result. The algorithm works!

Let's try another search:

Searching for: "detective murder mystery crime"

 1. [0.000139] The Adventures of Sherlock Holmes.txt
 2. [0.000112] The Souls of Black Folk.txt
 3. [0.000093] The Strange Case Of Dr. Jekyll And Mr. Hyde.txt
 4. [0.000089] Crime And Punishment.txt

Sherlock Holmes takes the top spot for detective-related queries.

Key Takeaways

TF-IDF is elegant: Just two simple metrics combined give surprisingly good results
Case-insensitivity matters: "Whale" and "whale" should match
Property-based testing catches edge cases: Random inputs find bugs you'd never think to test
The math is intuitive: Frequent + rare = important

What's Next?

This is part 1 of a series on building a distributed document search system. In upcoming posts, we'll add:

Part 2: HTTP networking layer for distributed workers
Part 3: Search coordinator for parallel processing
Part 4: ZooKeeper integration for leader election
Part 5: Full cluster deployment

Get the Code

The complete implementation is available on GitHub:

git clone git@github.com:UnplugCharger/distributed_doc_search.git
git checkout 01-tfidf-implementation
cd distributed-search-cluster-go
go run ./cmd/node/main.go whale ocean sea captain

This post is part of the "Distributed Document Search" series. Follow along as we build a production-ready search cluster from scratch.