Skip to main content
← Writing

The Greedy Algorithm for Submodular Maximization

· 13 min read
Series parts
  1. Part 1An Introduction to Submodularity
  2. Part 2The Greedy Algorithm for Submodular Maximization
  3. Part 3Stochastic Greedy: Scaling Submodular Maximization to Massive Datasets
On this page

In Part 1, I defined submodular functions, showed that they capture the pattern of diminishing returns across sensor placement, document summarization, influence maximization, and many other domains, and formalized the constrained maximization problem: given a monotone submodular function ff and a budget kk, find maxSVf(S)\max_{S \subseteq \mathcal{V}} f(S) subject to Sk|S| \leq k. I also showed that this problem is NP-hard: enumerating all (nk)\binom{n}{k} candidate subsets is intractable for any realistic problem size.

The question becomes: how do we find a good solution efficiently, given that finding the exact optimum is out of reach? The answer lies in approximation algorithms.

Approximation Algorithms

An α\alpha-approximation algorithm for a maximization problem is an algorithm that runs in polynomial time and, for every instance of the problem, returns a solution whose value is at least αOPT\alpha \cdot \text{OPT}, where OPT\text{OPT} is the value of the optimal solution. The factor α(0,1]\alpha \in (0, 1] is called the approximation ratio (or performance guarantee).

For instance, a 12\frac{1}{2}-approximation algorithm for a maximization problem always produces a solution worth at least half of the true optimum. Always. On every instance. This is a worst-case guarantee, not an average-case one.

Three reasons to care about approximation algorithms:

  1. Polynomial-time solutions to NP-hard problems. We cannot solve these problems exactly in polynomial time (unless P == NP), but we can get provably close.
  2. A rigorous metric for comparing heuristics. Instead of “this heuristic works well on my benchmark,” you get “this algorithm is within 63.2% of optimal on every possible input.” The guarantee does not depend on the dataset.
  3. Implicit bounds on the optimum. When an α\alpha-approximation algorithm returns a solution of value VV, it also tells you that OPTV/α\text{OPT} \leq V / \alpha. This is useful when computing the exact optimum is too expensive; you get a bound for free.

For minimization problems, the convention flips: α1\alpha \geq 1, and the algorithm’s output is at most αOPT\alpha \cdot \text{OPT}. A 22-approximation for a minimization problem never exceeds twice the optimal cost.

The Greedy Algorithm

The greedy algorithm for monotone submodular maximization under a cardinality constraint is exactly the algorithm you would write as a first attempt. Start with an empty set. At each step, scan all remaining elements, pick the one with the largest marginal gain, and add it to the solution. Repeat kk times.

Greedy(f,V,k)S0for j=1 to k:eargmaxeVSj1f(eSj1)SjSj1{e}return Sk\boxed{ \begin{aligned} & \textbf{Greedy}(f, \mathcal{V}, k) \\ & \quad S_0 \leftarrow \varnothing \\ & \quad \textbf{for } j = 1 \textbf{ to } k\textbf{:} \\ & \quad \quad e^* \leftarrow \arg\max_{e \in \mathcal{V} \setminus S_{j-1}} f(e \mid S_{j-1}) \\ & \quad \quad S_j \leftarrow S_{j-1} \cup \{e^*\} \\ & \quad \textbf{return } S_k \end{aligned} }

That is the entire algorithm. No relaxation step, no rounding, no linear program. Just a loop with a max operation.

Runtime. At each of the kk iterations, the algorithm evaluates f(eSj1)f(e \mid S_{j-1}) for every element eVSj1e \in \mathcal{V} \setminus S_{j-1}. Each marginal gain computation requires one call to the value oracle for ff. The total cost is O(nk)\mathcal{O}(n \cdot k) oracle queries.

A Worked Example: Sensor Coverage

Consider 6 candidate sensor locations on a grid, labeled {1,2,3,4,5,6}\{1, 2, 3, 4, 5, 6\}, and a budget of k=3k = 3. Each sensor covers a circular region, and the submodular function f(S)f(S) measures the total area covered by the union of all sensors in SS.

Greedy Algorithm Debugger

Six candidate sensor locations on a grid. You have a budget of k = 3 sensors. Each sensor covers a circular region. Step through the greedy algorithm to see which sensors it picks and why — or toggle "Try it yourself" to make your own choices and compare.

Step 0 of 3Selected: {}Coverage: 0.0%
Solid circles = selected sensors. Dashed arcs = new area each remaining sensor would add (only arcs outside the current coverage are drawn).
State Inspector
SensorNew coverageMarginal gain
3Greedy
15.2%
4
15.2%
6
15.2%
1
13.7%
5
13.4%
2
11.8%

Step 1. The algorithm computes the marginal gain f({e})f()=f({e})f(\{e\}) - f(\varnothing) = f(\{e\}) for each candidate e{1,,6}e \in \{1, \dots, 6\}, since the current set is empty. Suppose sensor 3 covers the largest area individually. We set S1={3}S_1 = \{3\}.

Step 2. The algorithm recomputes the marginal gain of each remaining element with respect to the current solution S1={3}S_1 = \{3\}. Sensor 5, which covers a region that barely overlaps with sensor 3, has the highest additional area. We set S2={3,5}S_2 = \{3, 5\}.

Step 3. Same logic. Sensor 1 adds the most new area given that sensors 3 and 5 are already placed. We set S3={3,5,1}S_3 = \{3, 5, 1\}.

The crucial observation: at each step, the marginal gain of the selected sensor decreases. The first sensor might cover 120 m2\text{m}^2 on its own; the second adds 95 m2\text{m}^2 of new area; the third adds only 60 m2\text{m}^2. This is submodularity at work: the same sensor would have contributed more if placed earlier, when less ground was already covered.

For a ground set this small, you can enumerate all (63)=20\binom{6}{3} = 20 possible subsets and verify that the greedy solution is close to (or exactly equal to) the true optimum. On realistic instances with thousands of candidate locations, enumeration is impossible, but the greedy solution is still guaranteed to be within a precise factor of the optimum.

The Approximation Guarantee

The following result, due to Nemhauser, Wolsey, and Fisher (1978)1, is arguably the single most celebrated theorem in submodular optimization.

Theorem. Let f:2VR+f : 2^{\mathcal{V}} \rightarrow \mathbb{R}_+ be a monotone submodular function, and let kk be a positive integer. The Greedy algorithm returns a set SkS_k satisfying:

f(Sk)(11e)f(S)f(S_k) \geq \left(1 - \frac{1}{e}\right) \cdot f(S^*)

where S=argmaxSkf(S)S^* = \arg\max_{|S| \leq k} f(S) is the optimal solution and e2.718e \approx 2.718 is Euler’s number.

Since 11/e0.6321 - 1/e \approx 0.632, the greedy algorithm always returns a solution worth at least 63.2%63.2\% of the true optimum. The guarantee holds for every monotone submodular function and every cardinality constraint kk. No randomization, no special structure, no tuning.

Two additional facts make this result striking:

  • The bound is tight: there exist instances where the greedy algorithm achieves exactly a (11/e)(1 - 1/e) fraction of the optimum and no better.
  • The bound is optimal: Feige (1998)2 proved that no polynomial-time algorithm can achieve an approximation ratio better than (11/e)(1 - 1/e) for this problem, unless P == NP.

Greedy is not just “a decent heuristic.” It is the best any efficient algorithm can do.

Greedy vs Random

Pick a submodular objective below, then run the race. Greedy picks the element with the highest marginal gain at each step. Random picks blindly. Both select k elements from a ground set of n. How close does each strategy get to the optimum?

Hire a team that covers the most distinct skills. Each candidate knows 2–5 skills; overlapping skills don't count twice.

Why Does It Work? A Proof Sketch

The proof has three key steps. I will present them in a way that emphasizes the geometric intuition rather than the full formalism.

Step 1: Each Greedy Step Closes a Fraction of the Gap

Let SjS_j be the greedy solution after jj steps, and let SS^* be the optimal solution with Sk|S^*| \leq k. Define the gap at step jj as:

δjf(S)f(Sj)\delta_j \coloneqq f(S^*) - f(S_j)

The gap measures how far the current greedy solution is from the optimum. At step j+1j+1, the greedy algorithm picks the element ee^* with the largest marginal gain f(eSj)f(e^* \mid S_j).

Here is the key argument. The optimal solution SS^* contains at most kk elements. By the monotonicity of ff, we know that f(SSj)f(S)f(S^* \cup S_j) \geq f(S^*), so the total marginal gain of adding all elements of SSjS^* \setminus S_j to SjS_j is at least f(S)f(Sj)=δjf(S^*) - f(S_j) = \delta_j. By submodularity, the marginal gains of individual elements of SS^* (with respect to SjS_j) sum to at least this gap3:

eSSjf(eSj)f(S)f(Sj)=δj\sum_{e \in S^* \setminus S_j} f(e \mid S_j) \geq f(S^*) - f(S_j) = \delta_j

Since Sk|S^*| \leq k, at least one element in SS^* has marginal gain at least δj/k\delta_j / k. The greedy algorithm picks the best element overall, so:

f(eSj)δjkf(e^* \mid S_j) \geq \frac{\delta_j}{k}

In words: each greedy step closes at least a 1/k1/k fraction of the remaining gap.

Step 2: The Gap Shrinks Geometrically

The inequality above means that after step j+1j+1:

δj+1=f(S)f(Sj+1)δjδjk=δj(11k)\delta_{j+1} = f(S^*) - f(S_{j+1}) \leq \delta_j - \frac{\delta_j}{k} = \delta_j \cdot \left(1 - \frac{1}{k}\right)

Applying this recursively from the initial gap δ0=f(S)f()\delta_0 = f(S^*) - f(\varnothing). (Assume f()=0f(\varnothing)=0, which is standard for these problems.)

δkf(S)(11k)k\delta_k \leq f(S^*) \cdot \left(1 - \frac{1}{k}\right)^k

Step 3: Bounding the Geometric Compound

The expression (11/k)k(1 - 1/k)^k is a well-known sequence from calculus. It is monotonically increasing and converges to 1/e0.3681/e \approx 0.368 as kk \to \infty4.

kk(11/k)k(1 - 1/k)^k
110.0000.000
220.2500.250
550.3280.328
10100.3490.349
50500.3640.364
1001000.3660.366
\to \infty1/e0.3681/e \approx 0.368
Convergence of (1 − 1/k)^k

The sequence rises steeply for small k and levels off well before k = 10. The dashed line marks the limit 1/e ≈ 0.368. Hover a point to see its value.

0.00.10.20.3151020304050k1/e

Since (11/k)k1/e(1 - 1/k)^k \leq 1/e for all k1k \geq 1, we get:

δkf(S)e\delta_k \leq \frac{f(S^*)}{e}

Rearranging:

f(Sk)=f(S)δkf(S)f(S)e=(11e)f(S)f(S_k) = f(S^*) - \delta_k \geq f(S^*) - \frac{f(S^*)}{e} = \left(1 - \frac{1}{e}\right) \cdot f(S^*)

That is the entire proof structure. Each step closes at least a 1/k1/k fraction of the gap; the gap compounds geometrically; and the geometric compound (11/k)k(1 - 1/k)^k is bounded above by 1/e1/e. The result is clean because submodularity gives us the per-step bound, and elementary calculus handles the rest.

Greedy Approximation Staircase

Each step closes at least a 1/k fraction of the remaining gap. The teal bars show f(Sj) rising; the faded region above is the gap δj. Step through to watch the gap compound geometrically.

Budget k
5 / 5
0.000.250.500.75f(S*)1−1/e012345step j+0.200+0.160+0.128+0.102+0.082
f(S5) = 0.6723δ5 = 0.3277(1 − 1/e) = 0.6321

Lazy Greedy: Exploiting Diminishing Returns for Speed

The standard greedy algorithm recomputes the marginal gain f(eSj1)f(e \mid S_{j-1}) for every remaining element ee at every step jj. This is wasteful: submodularity guarantees that marginal gains can only decrease over time. If an element had a low marginal gain at step j1j-1, it will have an even lower gain at step jj. Why recompute it?

Minoux (1978)5 formalized this observation into Lazy Greedy. The idea is to maintain a max-heap (priority queue) of upper bounds on marginal gains:

LazyGreedy(f,V,k)S0Initialize max-heap H with ρe+ for all eVfor j=1 to k:loop:eH.pop()ρef(eSj1)// recompute actual gainif ρeH.peek():SjSj1{e}breakelse:H.push(e,ρe)return Sk\boxed{ \begin{aligned} & \textbf{LazyGreedy}(f, \mathcal{V}, k) \\ & \quad S_0 \leftarrow \varnothing \\ & \quad \text{Initialize max-heap } H \text{ with } \rho_e \leftarrow +\infty \text{ for all } e \in \mathcal{V} \\ & \quad \textbf{for } j = 1 \textbf{ to } k\textbf{:} \\ & \quad \quad \textbf{loop:} \\ & \quad \quad \quad e \leftarrow H.\text{pop}() \\ & \quad \quad \quad \rho_e \leftarrow f(e \mid S_{j-1}) \quad \text{// recompute actual gain} \\ & \quad \quad \quad \textbf{if } \rho_e \geq H.\text{peek}()\textbf{:} \\ & \quad \quad \quad \quad S_j \leftarrow S_{j-1} \cup \{e\} \\ & \quad \quad \quad \quad \textbf{break} \\ & \quad \quad \quad \textbf{else:} \\ & \quad \quad \quad \quad H.\text{push}(e, \rho_e) \\ & \quad \textbf{return } S_k \end{aligned} }

At each step, Lazy Greedy pops the element ee with the highest upper bound from the heap and recomputes its actual marginal gain. If the recomputed gain is still the largest (i.e., ρe\rho_e \geq the next-highest upper bound in the heap), then by submodularity ee must have the largest true marginal gain among all remaining elements. The algorithm selects ee and moves on. Otherwise, it pushes ee back into the heap with its updated bound and tries the next candidate.

The worst-case runtime is still O(nk)\mathcal{O}(n \cdot k); you might end up recomputing every element at every step. In practice, however, Lazy Greedy often needs only a handful of recomputations per step, because elements whose gains were already low at the previous step are even lower now and stay deep in the heap. The speedup varies depending on the problem instance, but factors of 5x to 100x over standard greedy are common. The approximation guarantee remains (11/e)(1 - 1/e), unchanged.

If you have worked with lazy evaluation in functional programming or deferred computation in systems engineering, the pattern is familiar: avoid doing work until you know it matters.

When Greedy Fails

The (11/e)(1 - 1/e) guarantee depends critically on monotonicity. When ff is non-monotone (i.e., adding elements can decrease ff), the greedy algorithm can perform arbitrarily badly.

The thesis chapter I drew this material from contains a concrete counterexample. Consider a ground set V={1,2,,n}\mathcal{V} = \{1, 2, \dots, n\} and define the non-monotone submodular function:

f(S)={Sif nS2otherwisef(S) = \begin{cases} |S| & \text{if } n \notin S \\ 2 & \text{otherwise} \end{cases}

Let 2k<n2 \leq k < n. At the first step, the greedy algorithm adds element nn, because its marginal gain is f({n})f()=2f(\{n\}) - f(\varnothing) = 2, which is the largest among all singletons (every other element has a gain of 11). Once nn is in the solution, it stays there; the algorithm only adds elements, never removes them. The consequence: no matter what the algorithm picks in the remaining k1k-1 steps, the function value is stuck at 22, because the presence of nn forces f(S)=2f(S) = 2 for any SS containing nn.

The optimal solution is to take any kk elements other than nn, yielding f(S)=kf(S^*) = k. The greedy solution achieves f(Sk)=2f(S_k) = 2. The approximation ratio is 2/k2/k, which goes to 00 as kk grows. The algorithm gets trapped by a locally attractive choice that is globally catastrophic.

The lesson: for non-monotone submodular functions, standard greedy is the wrong tool. Randomized variants like the Random Greedy algorithm (Buchbinder et al. 2014)6 recover a constant approximation ratio of 1/e0.3681/e \approx 0.368 by introducing randomness into the selection step, at the cost of a weaker guarantee.

The type of constraint also matters. With a cardinality constraint, the greedy algorithm achieves (11/e)(1 - 1/e). With a single matroid constraint (a generalization of cardinality constraints), the same ratio holds, though the analysis is more involved7. With knapsack constraints, where each element has a cost and the total budget is limited, different algorithms are needed, and the landscape of results becomes considerably more complex.

Looking Ahead

The greedy algorithm requires O(nk)\mathcal{O}(n \cdot k) function evaluations. Lazy Greedy speeds things up in practice but offers the same worst-case bound. For moderate problem sizes, this is fine. For massive datasets (nn in the millions, kk in the thousands), even O(nk)\mathcal{O}(n \cdot k) becomes prohibitive.

In Part 3, I will introduce the Stochastic Greedy algorithm (Mirzasoleiman et al. 2015)8, which replaces the full scan over V\mathcal{V} at each step with a random subsample of size O((n/k)log(1/ε))\mathcal{O}((n/k) \log(1/\varepsilon)). The total cost drops from O(nk)\mathcal{O}(n \cdot k) to O(nlog(1/ε))\mathcal{O}(n \cdot \log(1/\varepsilon)); the factor of kk becomes a logarithmic term. The approximation guarantee degrades only slightly, to (11/eε)(1 - 1/e - \varepsilon) in expectation, where ε>0\varepsilon > 0 is a tunable parameter.

Key Takeaways

  • The greedy algorithm for monotone submodular maximization under a cardinality constraint is simple: at each step, pick the element with the largest marginal gain. Repeat kk times.
  • It achieves an approximation ratio of (11/e)0.632(1 - 1/e) \approx 0.632, provably the best any efficient algorithm can achieve for this problem.
  • The proof relies on a geometric argument: each greedy step closes at least a 1/k1/k fraction of the remaining gap to the optimum, and after kk steps the residual gap is at most f(S)/ef(S^*)/e.
  • Lazy Greedy exploits the diminishing returns property to skip redundant computations, often yielding 5x-100x speedups in practice with the same guarantee.
  • The greedy algorithm requires O(nk)\mathcal{O}(n \cdot k) function evaluations. For massive datasets, even this linear dependence on kk is too expensive, motivating the Stochastic Greedy algorithm in Part 3.
  • Monotonicity is critical: without it, greedy can achieve an approximation ratio as bad as 2/k2/k, which is effectively useless for large kk.

Further Reading