How Diff Algorithms Work: The Science of Comparing Text
Every time you review a Pull Request on GitHub, track changes in Word, or use any text comparison tool, a sophisticated diff algorithm is working behind the scenes. How does it find all the differences between two documents in mere milliseconds? Let's dive in.
What Is Diff?
Diff (difference comparison) is a technique for comparing two texts and identifying their differences. Its goal is to find the Minimum Edit Distance — the fewest operations (additions, deletions, modifications) needed to transform the original document into the new one.
A good diff algorithm doesn't just find differences — it produces a human-readable, intuitive difference report. The same two documents could yield multiple valid diff results; the algorithm aims to find the most meaningful one.
Longest Common Subsequence (LCS)
The core concept behind diff algorithms is the Longest Common Subsequence (LCS). Simply put, the LCS is the longest portion that appears in both documents in the same order.
Once the LCS is found, everything not in the LCS represents differences — elements in the original but not in the LCS are "deletions," and elements in the new document but not in the LCS are "additions."
Key Takeaway: LCS is the mathematical foundation of diff algorithms. Finding the longest common subsequence is equivalent to finding the minimal set of differences.
Myers Diff Algorithm
In 1986, Eugene W. Myers published a groundbreaking paper "An O(ND) Difference Algorithm and Its Variations," proposing what is now the most widely used diff algorithm.
Core Idea
The brilliance of Myers' algorithm lies in transforming the diff problem into a shortest-path problem in graph theory. It searches for the shortest path from the top-left to bottom-right corner of an Edit Graph:
- Moving right — Represents deleting a line from the original
- Moving down — Represents adding a line from the new document
- Diagonal movement — Represents matching lines (no change needed)
The algorithm's goal is to find the path that maximizes diagonal movements (minimizing additions and deletions).
Time Complexity
Myers' algorithm has a time complexity of O(ND), where N is the total length of both documents and D is the number of differences. When differences are few (the most common case), the algorithm is extremely efficient.
| Algorithm | Time Complexity | Characteristics |
|---|---|---|
| Brute Force | O(2^N) | Enumerates all possibilities, impractical |
| Dynamic Programming LCS | O(N*M) | Classic approach, high memory usage |
| Myers Diff | O(ND) | Extremely fast when differences are few |
| Patience Diff | O(N log N) | More human-friendly results |
Hunt-McIlroy Algorithm
Before Myers, Hunt and McIlroy proposed the original Unix diff algorithm in 1976. Also based on LCS, it uses a different search strategy — first finding all matching lines, then selecting the longest common subsequence from them.
The Hunt-McIlroy algorithm has O(N log N) complexity where N is the number of matching lines. It's efficient when matches are few but may be slower than Myers when matches are abundant.
Patience Diff
Patience Diff is a variant of Myers that uses a "patience sorting" strategy. Its distinguishing feature is producing diff results that are more readable for humans. Git allows you to use git diff --patience to enable this algorithm.
The core idea is to first identify unique lines in both documents, use them as anchor points, then perform fine-grained comparison between anchors. This typically produces results that better match human intuition.
Line-Level vs Character-Level Comparison
Most diff tools default to line-level comparison, but sometimes you need to know exactly which characters changed within a line. Character-level diff pinpoints every character change.
- Line-level diff — Ideal for code, config files, and structured text
- Character-level diff — Ideal for articles, contracts, and text requiring precise tracking
- Word-level diff — A middle ground, ideal for natural language text
Real-World Applications
Diff algorithms are ubiquitous in modern software development:
- Git — Core functionality of version control systems
- Code Review — PR diff views on GitHub and GitLab
- Document Editing — Track Changes in Word and Google Docs
- Databases — Comparing schema changes
Our text diff tool is built on these algorithms, letting you quickly find differences between any two pieces of text.
Try the Text Diff Tool Now →Conclusion
Diff algorithms may seem simple on the surface, but they involve deep computer science principles. From Hunt-McIlroy in 1976 to Myers in 1986 to modern variants, this field continues to evolve. Next time you see a clean diff report, you'll appreciate the elegant mathematics behind it.
References
- Myers, Eugene W. "An O(ND) Difference Algorithm and Its Variations." Algorithmica, 1(2), 1986, pp. 251-266. https://doi.org/10.1007/BF01840446
- Hunt, James W. and McIlroy, M. Douglas. "An Algorithm for Differential File Comparison." Bell Laboratories Computing Science Technical Report, No. 41, 1976.
- GNU Project. "GNU Diffutils Manual." Free Software Foundation, 2023. https://www.gnu.org/software/diffutils/manual/
- Bram Cohen. "Patience Diff Advantages." Codeville Blog, 2005. https://bramcohen.livejournal.com/73318.html