An Introduction to Submodularity

Imagine you are hiring engineers for a new team. You have a budget for $k$ hires, and each candidate brings a mix of skills: languages, frameworks, domain expertise. The first person you hire is enormously valuable: they fill every gap, because the team has no one yet. The second hire is still great, but some of their skills overlap with person number one. By the time you are picking the tenth engineer, the team already covers most of what you need, and each additional hire moves the needle less.

This pattern, where each new addition helps but helps less than the previous one, is called diminishing returns. It shows up constantly in engineering and optimization: adding servers to a load-balanced cluster, selecting test cases for a test suite, picking features for a machine learning model, choosing seed users for a viral marketing campaign. The pattern is so pervasive that mathematicians gave it a name and built an entire theory around it.

That name is submodularity.

Set Functions: The Building Block

Before I can define submodularity precisely, I need one prerequisite: the notion of a set function.

A set function is a function whose input is a set rather than a number. You start with a finite collection of items called the ground set, written as $\mathcal{V} = \{1, 2, \dots, n\}$ . These items can be anything: job candidates, sensor locations, advertising channels, data points. A set function $f$ takes any subset $S \subseteq \mathcal{V}$ and maps it to a real number. Formally, $f : 2^{\mathcal{V}} \rightarrow \mathbb{R}$ ¹.

Two concrete examples:

$f(S)$ = the number of distinct programming languages known by the team members in $S$ .
$f(S)$ = the total geographic area covered by sensors placed at the locations in $S$ .

Both of these functions take a set as input and produce a single number. That is all a set function is.

Diminishing Returns, Formalized

Back to the hiring example. Suppose you have four candidates:

Candidate	Languages
Alice	Python, Java, TypeScript
Bob	Python, Go
Carol	Java, Rust
David	Python, Java, Go

Define $f(S)$ as the number of distinct languages covered by the candidates in $S$ . The marginal gain of adding a person $e$ to an existing team $S$ measures how much $f$ increases:

f(e \mid S) \coloneqq f(S \cup \{e\}) - f(S)

Toggle candidates on and off to explore how $f(S)$ and marginal gains change as the team grows:

Marginal Gain Explorer

Toggle candidates on and off to build a team. The left panel shows who is selected; the right shows how many new skills each remaining candidate would add. Notice how marginal gains shrink as the team grows — that is diminishing returns in action.

Ground Set

PythonJavaTypeScript

PythonGo

JavaRust

PythonJavaGo

Current Set

S = {}f(S) = 0

Marginal Gains

Alice

PythonJavaTypeScript

Bob

PythonGo

Carol

JavaRust

David

PythonJavaGo

Consider David’s marginal gain across three scenarios:

Adding David to $\varnothing$ : the team goes from nothing to {Python, Java, Go}. Marginal gain $= 3$ .
Adding David to $\{\text{Alice}\}$ : the team already covers {Python, Java, TypeScript}. David adds Go, but Python and Java overlap. Marginal gain $= 1$ .
Adding David to $\{\text{Alice}, \text{Bob}\}$ : the team already covers {Python, Java, TypeScript, Go}. Every language David knows is redundant. Marginal gain $= 0$ .

David’s value drops from $3$ to $1$ to $0$ as the team grows, even though his skills have not changed. The context grew around him, making his contribution increasingly redundant. Now consider Bob, who is the only person who knows Go:

Adding Bob to $\varnothing$ : marginal gain $= 2$ .
Adding Bob to $\{\text{Alice}\}$ : marginal gain $= 1$ (Go is new; Python overlaps).
Adding Bob to $\{\text{Alice}, \text{Carol}, \text{David}\}$ : marginal gain $= 0$ (David already brought Go; Python was covered by Alice).

Bob’s marginal gain also decreases, but it drops to zero only once David is on the team, because David is the other person who knows Go.

The pattern is consistent: marginal gains can only stay the same or decrease as the context set grows. This is the diminishing returns property in action.

A function $f$ is submodular when this pattern holds systematically: for any element $e \in \mathcal{V} \setminus B$ and any two sets $A \subseteq B \subseteq \mathcal{V}$ , adding $e$ to the smaller set $A$ gives at least as much gain as adding $e$ to the larger set $B$ :

f(A \cup \{e\}) - f(A) \geq f(B \cup \{e\}) - f(B)

In plain language: the marginal gain of any element can only stay the same or decrease as the context set grows. Bigger context, smaller (or equal) marginal contribution.

Set Functions in Code

The concepts above translate directly into TypeScript. A set function is any function that takes a set and returns a number:

/**
 * A set function maps subsets of a ground set to real numbers.
 */
type SetFunction<E> = (s: ReadonlySet<E>) => number

/**
 * Marginal gain of adding element e to set S:
 * f(e | S) = f(S ∪ {e}) − f(S).
 */
function marginalGain<E>(
  f: SetFunction<E>,
  s: ReadonlySet<E>,
  e: E,
): number {
  const withE = new Set(s)
  withE.add(e)
  return f(withE) - f(s)
}

The SetFunction type is generic over the element type E; it works whether your ground set contains strings, numbers, or any other type. The marginalGain function is a direct translation of the formula $f(e \mid S) = f(S \cup \{e\}) - f(S)$ .

The language-coverage function from the hiring example is straightforward to implement:

// Ground set: four candidates, each covering a set of programming languages.
const skills: Record<string, ReadonlySet<string>> = {
  Alice: new Set(["Python", "Java", "TypeScript"]),
  Bob:   new Set(["Python", "Go"]),
  Carol: new Set(["Java", "Rust"]),
  David: new Set(["Python", "Java", "Go"]),
}

// f(S) = number of distinct languages covered by the candidates in S.
const languageCoverage: SetFunction<string> = (s) => {
  const covered = new Set<string>()
  for (const candidate of s) {
    const langs = skills[candidate]
    if (langs) for (const l of langs) covered.add(l)
  }
  return covered.size
}

// Try it:
languageCoverage(new Set(["Alice"]))
// → 3  (Python, Java, TypeScript)
languageCoverage(new Set(["Alice", "Bob"]))
// → 4  (+ Go)
languageCoverage(new Set(["Alice", "Bob", "Carol"]))
// → 5  (+ Rust)

// Marginal gain of David given {Alice}:
marginalGain(languageCoverage, new Set(["Alice"]), "David")
// → 1  (Go is new)

Notice how the function itself is the thing that exhibits diminishing returns; the code does not enforce submodularity. It just evaluates the set function honestly. Whether a particular SetFunction is submodular depends on the problem structure, not on the implementation.

This SetFunction / marginalGain pair is all the machinery we need. In Part 2, we will build the greedy algorithm on top of these two primitives and prove that it achieves a $(1 - 1/e)$ approximation guarantee.

Two Equivalent Definitions

The diminishing returns formulation is the most intuitive one, but there is an equivalent definition that you will encounter if you read any paper on the topic:

f(A) + f(B) \geq f(A \cup B) + f(A \cap B) \quad \text{for all } A, B \subseteq \mathcal{V}

The intuition: when you take the union of $A$ and $B$ , the overlapping elements get “counted” less efficiently than if you evaluated $A$ and $B$ separately. The union “wastes” some value because of redundancy; the intersection is lean.

These two definitions, diminishing returns and the union/intersection inequality, are mathematically equivalent for set functions. I will not prove this here (the proof is a short induction argument²), but both forms matter in practice. The diminishing returns version is better for building intuition; the union/intersection version is often easier to work with in proofs.

One important point: submodularity is a property of the function $f$ , not of any specific algorithm or data structure. It describes the shape of how value accumulates when you add elements to a set.

Submodularity Verifier

Is a given set function submodular? Pick a preset, define a coverage matrix, or enter values directly. The verifier checks every pair of sets against both definitions — diminishing returns and the union/intersection inequality — and shows a witness or counterexample.

PresetsCoverage MatrixValue Table

Coverage

f(S) = distinct skills covered. Submodular + monotone.

Cut

f(S) = edges between S and V\S. Submodular + NOT monotone.

Modular

f(S) = sum of weights. Submodular + monotone + modular.

Synergy

f(S) = |S| + bonus if {a,b} ⊆ S. NOT submodular.

Results

✓ Submodular✓ Monotone

View as:Diminishing ReturnsUnion/Intersection

All 108 inequalities satisfied. Tightest:

f(a | {b, c, d}) = f({a, b, c, d}) − f({b, c, d}) = 5 − 4 = 1

Gap: 1 − 1 = 0 ≥ 0 ✓

Monotonicity and Modularity

Two related properties come up alongside submodularity often enough that they are worth defining upfront.

Monotone functions. A set function $f$ is monotone if adding elements never hurts: whenever $A \subseteq B$ , it holds that $f(A) \leq f(B)$ . The language-coverage function from the hiring example is monotone: more people on the team means the same or more languages covered; you never lose coverage by hiring someone. However, not every submodular function is monotone. A classic counterexample is the graph cut function³, where $f(S)$ counts the edges between the nodes in $S$ and the nodes outside $S$ . Selecting all nodes gives a cut of zero, which is worse than a well-chosen subset.

Modular functions. A function is modular if the marginal gain of every element is the same regardless of context: no diminishing returns, no increasing returns. This is the discrete analogue of a linear function:

f(S) = \sum_{e \in S} w(e)

where each element $e$ has a fixed weight $w(e)$ . In the hiring analogy, a modular function would mean every candidate brings skills that nobody else has. In reality, skills overlap, which is exactly why most coverage-type objectives are submodular rather than modular.

For completeness: supermodular functions are the mirror image of submodular ones, where marginal gains increase with set size. They model cooperation effects: the value of adding a new member grows when more people are already present. Formally, $f$ is supermodular if and only if $-f$ is submodular. Supermodularity is relevant in game theory and social choice, but this blog series focuses on the submodular side.

Where Submodularity Shows Up

The language-coverage example is deliberately simple. The concept becomes genuinely useful when you realize how many real problems exhibit this structure.

Sensor placement. You want to place $k$ sensors across a city to maximize the total area monitored. The first sensor covers a fresh region. The fifth sensor’s coverage inevitably overlaps with the sensors already deployed. The objective function, total monitored area as a function of the selected sensor locations, is monotone submodular⁴.

Document summarization. Given a large corpus of text, you want to select $k$ sentences that best summarize the content. Each additional sentence contributes less novel information as the summary grows. Lin and Bilmes (2011)⁵ formalized this as a submodular maximization problem and obtained results that are competitive with much heavier NLP pipelines.

Feature selection in machine learning. When selecting $k$ features from a dataset to train a model, each new feature adds less predictive power if it is correlated with features already chosen. Information-gain-based objectives are submodular⁶.

Influence maximization in social networks. Pick $k$ seed users to maximize the spread of a message through a network. The more influencers you already have, the less incremental reach a new seed user provides. Kempe, Kleinberg, and Tardos (2003)⁷ showed this is a submodular maximization problem, and the result spawned an entire research direction.

Facility location. Where should you place $k$ warehouses, hospitals, or CDN edge servers to maximize service coverage? Given a matrix $M \in \mathbb{R}^{m \times n}$ where $M_{ij}$ quantifies the service value that facility $j$ provides to customer $i$ , the total service value is:

f(S) \coloneqq \sum_{i=1}^{m} \max_{j \in S} M_{ij}

Each new facility serves the population closest to it, but as you add more, the marginal improvement in total service quality diminishes. This function is monotone submodular when $M_{ij} \geq 0$ for all $i, j$ ⁸.

Optimal budget allocation. You have a fixed advertising budget to distribute among $k$ channels. Each dollar spent on a channel influences potential customers, but with diminishing effectiveness as more budget is allocated⁹. This was a motivating problem for my own thesis work, and I will come back to it in Part 3.

The common thread: these are all combinatorial selection problems where the objective exhibits diminishing returns. Economists have studied diminishing returns for centuries¹⁰. Submodularity is the combinatorial optimization community’s formalization of the same idea.

4 Frieze (1974); Krause, Guestrin, Gupta, and Kleinberg, Near-Optimal Sensor Placements, IPSN (2006).
5 Lin and Bilmes, A Class of Submodular Functions for Document Summarization, ACL (2011).
6 Krause and Guestrin, Near-Optimal Nonmyopic Value of Information in Graphical Models, UAI (2005). Mutual-information objectives are submodular because conditioning on more features can only reduce the remaining entropy.
7 Kempe, Kleinberg, and Tardos, Maximizing the Spread of Influence through a Social Network, KDD (2003).
8 Cornuéjols, Fisher, and Nemhauser, On the Uncapacitated Location Problem (1977). The monotone submodularity when $M_{ij} \geq 0$ was shown by Frieze (1974).
9 Alon, Gamzu, and Tennenholtz, Optimizing Budget Allocation among Channels and Influencers, WWW (2012).
10 Economists have studied diminishing marginal utility since at least the 1870s. Vondrák's Ph.D. thesis, Submodularity in Combinatorial Optimization (2007), discusses the connection to the formal combinatorial property.

Submodularity as Discrete Concavity

If you remember concave functions from calculus, there is a clean analogy. A concave function $g(x)$ on the real line has a decreasing derivative: the slope flattens as $x$ grows. A submodular function behaves the same way, but in the discrete world: the “derivative” (marginal gain) decreases as the set grows.

Concavity Bridge

Submodularity is the discrete analogue of concavity. The left panel shows a continuous curve g(x); the right shows the discrete marginal gains Δg(i) = g(i) − g(i−1). Drag the point along the curve and watch how the tangent slope mirrors the height of the corresponding marginal gain bar.

Bars decay exponentially → Submodular (hard ceiling)

g(x) — continuous curve

g(k+1) − g(k) — marginal gains

When f(S) = g(|S|) depends only on set size, f is submodular if and only if g is concave. Drag the point to see how the tangent slope mirrors the discrete marginal gain.

This analogy is more than a metaphor. There is a formal result: if you define a set function $f(S) \coloneqq g(|S|)$ that only depends on the cardinality of $S$ , then $f$ is submodular if and only if $g$ is concave¹¹. The discrete notion and the continuous notion align perfectly in this special case.

However, the full picture is more nuanced than “submodularity $=$ concavity.” Submodular functions relate to both convexity and concavity, depending on which problem you are solving. Minimizing a submodular function (without constraints) can be done in polynomial time, analogously to minimizing a convex function. There is even an elegant continuous extension called the Lovász extension¹² $\hat{f} : [0,1]^{\mathcal{V}} \rightarrow \mathbb{R}$ that makes this precise. When $f$ is submodular, $\hat{f}$ is convex, and minimizing $f$ over subsets of $\mathcal{V}$ reduces to minimizing $\hat{f}$ over the unit hypercube $[0,1]^{\mathcal{V}}$ .

Maximizing a submodular function, on the other hand, is NP-hard in general, since it generalizes notoriously hard problems like Max Cut¹³. This asymmetry, where minimization is tractable but maximization is not, is one of the deepest structural facts about submodularity.

The Maximization Problem

The central problem studied in this series is constrained submodular maximization: given a monotone submodular function $f$ and a budget of $k$ items, find the subset that maximizes $f$ :

\max_{S \subseteq \mathcal{V}} f(S) \quad \text{subject to} \quad |S| \leq k

The cardinality constraint $|S| \leq k$ is what makes this interesting. Without it, the answer for a monotone function is trivially $S = \mathcal{V}$ : just take everything. The constraint forces a selection, and it is the tension between the budget and the diminishing returns structure that gives the problem its depth.

This problem is NP-hard. It generalizes Max- $k$ -Coverage¹⁴, and the naive approach of enumerating all $\binom{n}{k}$ subsets of size $k$ is out of the question for any realistic $n$ and $k$ . With $n = 1{,}000$ and $k = 50$ , you would need to evaluate roughly $10^{85}$ subsets, more than the estimated number of atoms in the observable universe.

The real punchline, and the reason this topic is so appealing, is that even though exact maximization is NP-hard, there exist simple polynomial-time algorithms that provably return solutions within a known factor of the true optimum. That factor turns out to be $(1 - 1/e) \approx 0.632$ ¹⁵, and it comes from the simplest possible strategy: a greedy algorithm. I will explain exactly how and why in Part 2.

How I Became Interested in Submodularity

I first encountered submodularity in December 2020, when I virtually met Dr. Vyacheslav Kungurtsev at an internship proposal event organized by the Department of Mathematics at the University of Padova. He introduced me to Prof. Jakub Mareček at the Czech Technical University in Prague, and together we started a research collaboration that lasted from March to November 2021: weekly virtual meetings, code reviews, and thousands of emails.

They introduced me to the concept of submodular optimization and pushed me to read hundreds of papers on the topic. The first months were rough. We ran into several dead ends, and some of the approaches we tried were both slower and worse than existing methods. My thesis only contains the approaches that actually worked.

What kept me engaged was the elegance of the core results. Most NP-hard problems offer either impractical exact algorithms or heuristics with no quality guarantees. You run the heuristic, get a number, and hope for the best. Submodular maximization is different: a simple greedy algorithm, the kind you would write as a first attempt, comes with a mathematical certificate that its output is at least $63.2\%$ of the true optimum. Always. On every instance. That is rare in combinatorial optimization, and I found it genuinely surprising.

The collaboration eventually led to a paper¹⁶, “Randomized Algorithms for Monotone Submodular Function Maximization on the Integer Lattice”, where we extended the stochastic greedy technique from the set domain to the integer lattice $\mathbb{Z}_{+}^{\mathcal{V}}$ , proving that our algorithms are faster than the previous approaches on realistic instances while retaining the same $(1 - 1/e - \varepsilon)$ approximation guarantees. I will tease some of those results in Part 3.

Key Takeaways

Submodularity captures the pattern of diminishing returns: the more you already have, the less each new item adds.
A set function $f : 2^{\mathcal{V}} \rightarrow \mathbb{R}$ is submodular when the marginal gain $f(e \mid S)$ of any element $e$ can only decrease (or stay the same) as the context set $S$ grows.
The concept appears in sensor placement, document summarization, influence maximization, feature selection, facility location, budget allocation, and many other combinatorial selection problems.
Submodularity is to discrete optimization what concavity is to continuous optimization, but the relationship with convexity and concavity is richer than the analogy suggests.
The constrained maximization problem, $\max f(S)$ subject to $|S| \leq k$ , is NP-hard, yet efficient algorithms with provable approximation guarantees exist. Part 2 of this series covers the greedy algorithm and its $(1 - 1/e)$ guarantee.