Skip to main content
← Writing

An Introduction to Submodularity

· 15 min read
Series parts
  1. Part 1An Introduction to Submodularity
  2. Part 2The Greedy Algorithm for Submodular Maximization
  3. Part 3Stochastic Greedy: Scaling Submodular Maximization to Massive Datasets
On this page

Imagine you are hiring engineers for a new team. You have a budget for kk hires, and each candidate brings a mix of skills: languages, frameworks, domain expertise. The first person you hire is enormously valuable: they fill every gap, because the team has no one yet. The second hire is still great, but some of their skills overlap with person number one. By the time you are picking the tenth engineer, the team already covers most of what you need, and each additional hire moves the needle less.

This pattern, where each new addition helps but helps less than the previous one, is called diminishing returns. It shows up constantly in engineering and optimization: adding servers to a load-balanced cluster, selecting test cases for a test suite, picking features for a machine learning model, choosing seed users for a viral marketing campaign. The pattern is so pervasive that mathematicians gave it a name and built an entire theory around it.

That name is submodularity.

Set Functions: The Building Block

Before I can define submodularity precisely, I need one prerequisite: the notion of a set function.

A set function is a function whose input is a set rather than a number. You start with a finite collection of items called the ground set, written as V={1,2,,n}\mathcal{V} = \{1, 2, \dots, n\}. These items can be anything: job candidates, sensor locations, advertising channels, data points. A set function ff takes any subset SVS \subseteq \mathcal{V} and maps it to a real number. Formally, f:2VRf : 2^{\mathcal{V}} \rightarrow \mathbb{R}1.

Two concrete examples:

  • f(S)f(S) = the number of distinct programming languages known by the team members in SS.
  • f(S)f(S) = the total geographic area covered by sensors placed at the locations in SS.

Both of these functions take a set as input and produce a single number. That is all a set function is.

Diminishing Returns, Formalized

Back to the hiring example. Suppose you have four candidates:

CandidateLanguages
AlicePython, Java, TypeScript
BobPython, Go
CarolJava, Rust
DavidPython, Java, Go

Define f(S)f(S) as the number of distinct languages covered by the candidates in SS. The marginal gain of adding a person ee to an existing team SS measures how much ff increases:

f(eS)f(S{e})f(S)f(e \mid S) \coloneqq f(S \cup \{e\}) - f(S)

Toggle candidates on and off to explore how f(S)f(S) and marginal gains change as the team grows:

Marginal Gain Explorer

Toggle candidates on and off to build a team. The left panel shows who is selected; the right shows how many new skills each remaining candidate would add. Notice how marginal gains shrink as the team grows — that is diminishing returns in action.

Ground Set
PythonJavaTypeScript
PythonGo
JavaRust
PythonJavaGo
Current Set
S = {}f(S) = 0
Marginal Gains
Alice
+3
PythonJavaTypeScript
Bob
+2
PythonGo
Carol
+2
JavaRust
David
+3
PythonJavaGo

Consider David’s marginal gain across three scenarios:

  • Adding David to \varnothing: the team goes from nothing to {Python, Java, Go}. Marginal gain =3= 3.
  • Adding David to {Alice}\{\text{Alice}\}: the team already covers {Python, Java, TypeScript}. David adds Go, but Python and Java overlap. Marginal gain =1= 1.
  • Adding David to {Alice,Bob}\{\text{Alice}, \text{Bob}\}: the team already covers {Python, Java, TypeScript, Go}. Every language David knows is redundant. Marginal gain =0= 0.

David’s value drops from 33 to 11 to 00 as the team grows, even though his skills have not changed. The context grew around him, making his contribution increasingly redundant. Now consider Bob, who is the only person who knows Go:

  • Adding Bob to \varnothing: marginal gain =2= 2.
  • Adding Bob to {Alice}\{\text{Alice}\}: marginal gain =1= 1 (Go is new; Python overlaps).
  • Adding Bob to {Alice,Carol,David}\{\text{Alice}, \text{Carol}, \text{David}\}: marginal gain =0= 0 (David already brought Go; Python was covered by Alice).

Bob’s marginal gain also decreases, but it drops to zero only once David is on the team, because David is the other person who knows Go.

The pattern is consistent: marginal gains can only stay the same or decrease as the context set grows. This is the diminishing returns property in action.

A function ff is submodular when this pattern holds systematically: for any element eVBe \in \mathcal{V} \setminus B and any two sets ABVA \subseteq B \subseteq \mathcal{V}, adding ee to the smaller set AA gives at least as much gain as adding ee to the larger set BB:

f(A{e})f(A)f(B{e})f(B)f(A \cup \{e\}) - f(A) \geq f(B \cup \{e\}) - f(B)

In plain language: the marginal gain of any element can only stay the same or decrease as the context set grows. Bigger context, smaller (or equal) marginal contribution.

Set Functions in Code

The concepts above translate directly into TypeScript. A set function is any function that takes a set and returns a number:

set-function.ts
/**
* A set function maps subsets of a ground set to real numbers.
*/
type SetFunction<E> = (s: ReadonlySet<E>) => number
/**
* Marginal gain of adding element e to set S:
* f(e | S) = f(S ∪ {e}) − f(S).
*/
function marginalGain<E>(
f: SetFunction<E>,
s: ReadonlySet<E>,
e: E,
): number {
const withE = new Set(s)
withE.add(e)
return f(withE) - f(s)
}

The SetFunction type is generic over the element type E; it works whether your ground set contains strings, numbers, or any other type. The marginalGain function is a direct translation of the formula f(eS)=f(S{e})f(S)f(e \mid S) = f(S \cup \{e\}) - f(S).

The language-coverage function from the hiring example is straightforward to implement:

language-coverage.ts
// Ground set: four candidates, each covering a set of programming languages.
const skills: Record<string, ReadonlySet<string>> = {
Alice: new Set(["Python", "Java", "TypeScript"]),
Bob: new Set(["Python", "Go"]),
Carol: new Set(["Java", "Rust"]),
David: new Set(["Python", "Java", "Go"]),
}
// f(S) = number of distinct languages covered by the candidates in S.
const languageCoverage: SetFunction<string> = (s) => {
const covered = new Set<string>()
for (const candidate of s) {
const langs = skills[candidate]
if (langs) for (const l of langs) covered.add(l)
}
return covered.size
}
// Try it:
languageCoverage(new Set(["Alice"]))
// → 3 (Python, Java, TypeScript)
languageCoverage(new Set(["Alice", "Bob"]))
// → 4 (+ Go)
languageCoverage(new Set(["Alice", "Bob", "Carol"]))
// → 5 (+ Rust)
// Marginal gain of David given {Alice}:
marginalGain(languageCoverage, new Set(["Alice"]), "David")
// → 1 (Go is new)

Notice how the function itself is the thing that exhibits diminishing returns; the code does not enforce submodularity. It just evaluates the set function honestly. Whether a particular SetFunction is submodular depends on the problem structure, not on the implementation.

This SetFunction / marginalGain pair is all the machinery we need. In Part 2, we will build the greedy algorithm on top of these two primitives and prove that it achieves a (11/e)(1 - 1/e) approximation guarantee.

Two Equivalent Definitions

The diminishing returns formulation is the most intuitive one, but there is an equivalent definition that you will encounter if you read any paper on the topic:

f(A)+f(B)f(AB)+f(AB)for all A,BVf(A) + f(B) \geq f(A \cup B) + f(A \cap B) \quad \text{for all } A, B \subseteq \mathcal{V}

The intuition: when you take the union of AA and BB, the overlapping elements get “counted” less efficiently than if you evaluated AA and BB separately. The union “wastes” some value because of redundancy; the intersection is lean.

These two definitions, diminishing returns and the union/intersection inequality, are mathematically equivalent for set functions. I will not prove this here (the proof is a short induction argument2), but both forms matter in practice. The diminishing returns version is better for building intuition; the union/intersection version is often easier to work with in proofs.

One important point: submodularity is a property of the function ff, not of any specific algorithm or data structure. It describes the shape of how value accumulates when you add elements to a set.

Submodularity Verifier

Is a given set function submodular? Pick a preset, define a coverage matrix, or enter values directly. The verifier checks every pair of sets against both definitions — diminishing returns and the union/intersection inequality — and shows a witness or counterexample.

Results
Submodular Monotone
View as:
All 108 inequalities satisfied. Tightest:
f(a | {b, c, d}) = f({a, b, c, d}) − f({b, c, d}) = 54 = 1
f(a | {b, c, d}) = f({a, b, c, d}) − f({b, c, d}) = 54 = 1
Gap: 11 = 0 ≥ 0  ✓

Monotonicity and Modularity

Two related properties come up alongside submodularity often enough that they are worth defining upfront.

Monotone functions. A set function ff is monotone if adding elements never hurts: whenever ABA \subseteq B, it holds that f(A)f(B)f(A) \leq f(B). The language-coverage function from the hiring example is monotone: more people on the team means the same or more languages covered; you never lose coverage by hiring someone. However, not every submodular function is monotone. A classic counterexample is the graph cut function3, where f(S)f(S) counts the edges between the nodes in SS and the nodes outside SS. Selecting all nodes gives a cut of zero, which is worse than a well-chosen subset.

Modular functions. A function is modular if the marginal gain of every element is the same regardless of context: no diminishing returns, no increasing returns. This is the discrete analogue of a linear function:

f(S)=eSw(e)f(S) = \sum_{e \in S} w(e)

where each element ee has a fixed weight w(e)w(e). In the hiring analogy, a modular function would mean every candidate brings skills that nobody else has. In reality, skills overlap, which is exactly why most coverage-type objectives are submodular rather than modular.

For completeness: supermodular functions are the mirror image of submodular ones, where marginal gains increase with set size. They model cooperation effects: the value of adding a new member grows when more people are already present. Formally, ff is supermodular if and only if f-f is submodular. Supermodularity is relevant in game theory and social choice, but this blog series focuses on the submodular side.

Where Submodularity Shows Up

The language-coverage example is deliberately simple. The concept becomes genuinely useful when you realize how many real problems exhibit this structure.

Sensor placement. You want to place kk sensors across a city to maximize the total area monitored. The first sensor covers a fresh region. The fifth sensor’s coverage inevitably overlaps with the sensors already deployed. The objective function, total monitored area as a function of the selected sensor locations, is monotone submodular4.

Document summarization. Given a large corpus of text, you want to select kk sentences that best summarize the content. Each additional sentence contributes less novel information as the summary grows. Lin and Bilmes (2011)5 formalized this as a submodular maximization problem and obtained results that are competitive with much heavier NLP pipelines.

Feature selection in machine learning. When selecting kk features from a dataset to train a model, each new feature adds less predictive power if it is correlated with features already chosen. Information-gain-based objectives are submodular6.

Influence maximization in social networks. Pick kk seed users to maximize the spread of a message through a network. The more influencers you already have, the less incremental reach a new seed user provides. Kempe, Kleinberg, and Tardos (2003)7 showed this is a submodular maximization problem, and the result spawned an entire research direction.

Facility location. Where should you place kk warehouses, hospitals, or CDN edge servers to maximize service coverage? Given a matrix MRm×nM \in \mathbb{R}^{m \times n} where MijM_{ij} quantifies the service value that facility jj provides to customer ii, the total service value is:

f(S)i=1mmaxjSMijf(S) \coloneqq \sum_{i=1}^{m} \max_{j \in S} M_{ij}

Each new facility serves the population closest to it, but as you add more, the marginal improvement in total service quality diminishes. This function is monotone submodular when Mij0M_{ij} \geq 0 for all i,ji, j8.

Optimal budget allocation. You have a fixed advertising budget to distribute among kk channels. Each dollar spent on a channel influences potential customers, but with diminishing effectiveness as more budget is allocated9. This was a motivating problem for my own thesis work, and I will come back to it in Part 3.

The common thread: these are all combinatorial selection problems where the objective exhibits diminishing returns. Economists have studied diminishing returns for centuries10. Submodularity is the combinatorial optimization community’s formalization of the same idea.

Submodularity as Discrete Concavity

If you remember concave functions from calculus, there is a clean analogy. A concave function g(x)g(x) on the real line has a decreasing derivative: the slope flattens as xx grows. A submodular function behaves the same way, but in the discrete world: the “derivative” (marginal gain) decreases as the set grows.

Concavity Bridge

Submodularity is the discrete analogue of concavity. The left panel shows a continuous curve g(x); the right shows the discrete marginal gains Δg(i) = g(i) − g(i−1). Drag the point along the curve and watch how the tangent slope mirrors the height of the corresponding marginal gain bar.

Bars decay exponentially → Submodular (hard ceiling)
g(x) — continuous curve
02468048slope = 5.36x = 0.4
g(k+1) − g(k) — marginal gains
01235.0601234567

When f(S) = g(|S|) depends only on set size, f is submodular if and only if g is concave. Drag the point to see how the tangent slope mirrors the discrete marginal gain.

This analogy is more than a metaphor. There is a formal result: if you define a set function f(S)g(S)f(S) \coloneqq g(|S|) that only depends on the cardinality of SS, then ff is submodular if and only if gg is concave11. The discrete notion and the continuous notion align perfectly in this special case.

However, the full picture is more nuanced than “submodularity == concavity.” Submodular functions relate to both convexity and concavity, depending on which problem you are solving. Minimizing a submodular function (without constraints) can be done in polynomial time, analogously to minimizing a convex function. There is even an elegant continuous extension called the Lovász extension12 f^:[0,1]VR\hat{f} : [0,1]^{\mathcal{V}} \rightarrow \mathbb{R} that makes this precise. When ff is submodular, f^\hat{f} is convex, and minimizing ff over subsets of V\mathcal{V} reduces to minimizing f^\hat{f} over the unit hypercube [0,1]V[0,1]^{\mathcal{V}}.

Maximizing a submodular function, on the other hand, is NP-hard in general, since it generalizes notoriously hard problems like Max Cut13. This asymmetry, where minimization is tractable but maximization is not, is one of the deepest structural facts about submodularity.

The Maximization Problem

The central problem studied in this series is constrained submodular maximization: given a monotone submodular function ff and a budget of kk items, find the subset that maximizes ff:

maxSVf(S)subject toSk\max_{S \subseteq \mathcal{V}} f(S) \quad \text{subject to} \quad |S| \leq k

The cardinality constraint Sk|S| \leq k is what makes this interesting. Without it, the answer for a monotone function is trivially S=VS = \mathcal{V}: just take everything. The constraint forces a selection, and it is the tension between the budget and the diminishing returns structure that gives the problem its depth.

This problem is NP-hard. It generalizes Max-kk-Coverage14, and the naive approach of enumerating all (nk)\binom{n}{k} subsets of size kk is out of the question for any realistic nn and kk. With n=1,000n = 1{,}000 and k=50k = 50, you would need to evaluate roughly 108510^{85} subsets, more than the estimated number of atoms in the observable universe.

The real punchline, and the reason this topic is so appealing, is that even though exact maximization is NP-hard, there exist simple polynomial-time algorithms that provably return solutions within a known factor of the true optimum. That factor turns out to be (11/e)0.632(1 - 1/e) \approx 0.63215, and it comes from the simplest possible strategy: a greedy algorithm. I will explain exactly how and why in Part 2.

How I Became Interested in Submodularity

I first encountered submodularity in December 2020, when I virtually met Dr. Vyacheslav Kungurtsev at an internship proposal event organized by the Department of Mathematics at the University of Padova. He introduced me to Prof. Jakub Mareček at the Czech Technical University in Prague, and together we started a research collaboration that lasted from March to November 2021: weekly virtual meetings, code reviews, and thousands of emails.

They introduced me to the concept of submodular optimization and pushed me to read hundreds of papers on the topic. The first months were rough. We ran into several dead ends, and some of the approaches we tried were both slower and worse than existing methods. My thesis only contains the approaches that actually worked.

What kept me engaged was the elegance of the core results. Most NP-hard problems offer either impractical exact algorithms or heuristics with no quality guarantees. You run the heuristic, get a number, and hope for the best. Submodular maximization is different: a simple greedy algorithm, the kind you would write as a first attempt, comes with a mathematical certificate that its output is at least 63.2%63.2\% of the true optimum. Always. On every instance. That is rare in combinatorial optimization, and I found it genuinely surprising.

The collaboration eventually led to a paper16, “Randomized Algorithms for Monotone Submodular Function Maximization on the Integer Lattice”, where we extended the stochastic greedy technique from the set domain to the integer lattice Z+V\mathbb{Z}_{+}^{\mathcal{V}}, proving that our algorithms are faster than the previous approaches on realistic instances while retaining the same (11/eε)(1 - 1/e - \varepsilon) approximation guarantees. I will tease some of those results in Part 3.

Key Takeaways

  • Submodularity captures the pattern of diminishing returns: the more you already have, the less each new item adds.
  • A set function f:2VRf : 2^{\mathcal{V}} \rightarrow \mathbb{R} is submodular when the marginal gain f(eS)f(e \mid S) of any element ee can only decrease (or stay the same) as the context set SS grows.
  • The concept appears in sensor placement, document summarization, influence maximization, feature selection, facility location, budget allocation, and many other combinatorial selection problems.
  • Submodularity is to discrete optimization what concavity is to continuous optimization, but the relationship with convexity and concavity is richer than the analogy suggests.
  • The constrained maximization problem, maxf(S)\max f(S) subject to Sk|S| \leq k, is NP-hard, yet efficient algorithms with provable approximation guarantees exist. Part 2 of this series covers the greedy algorithm and its (11/e)(1 - 1/e) guarantee.

Further Reading