Stochastic Greedy: Scaling Submodular Maximization to Massive Datasets
Series parts
On this page
In Part 2, I showed that the greedy algorithm achieves a approximation ratio for maximizing a monotone submodular function under a cardinality constraint, the best any polynomial-time algorithm can guarantee. The algorithm is simple: at each of steps, scan all elements, pick the one with the largest marginal gain, add it to the solution.
The cost is evaluations of . For modest problem sizes, this is perfectly adequate. The trouble starts when is large.
The Scalability Problem
Consider a concrete scenario: you are selecting representative images from a corpus of for a dataset summarization task. Each evaluation of , which might involve computing pairwise similarities or coverage scores, takes around 1ms. The greedy algorithm needs evaluations. At 1ms each, that is seconds, or roughly 115 days.
Lazy Greedy (Part 2) can help in practice by skipping redundant recomputations, sometimes by a factor of 5x to 100x. But the worst-case bound remains , and on adversarial or poorly structured instances the priority queue offers no benefit. We need an algorithm whose theoretical runtime is better, not just one that is faster on benign inputs.
The question is whether we can replace the linear dependence on with something smaller, while preserving (or nearly preserving) the guarantee.
The Key Idea: Random Subsampling
The answer turns out to be straightforward. Instead of scanning all remaining elements at each step to find the best marginal gain, sample a small random subset and pick the best element from .
The intuition is natural. The optimal solution contains elements scattered across the ground set . If you draw a random sample of size from the available elements, the probability that contains at least one element from is 1. Set and this probability exceeds . The sample does not need to be large; it just needs to be large enough to “hit” at least one good element with high probability.
Below is a ground set of 100 elements. 5 are "good" (the optimal solution S*). Each time you draw a sample, the algorithm picks s random elements and checks whether any overlap with S*. A hit means the sample found at least one good element. The question is: how large does s need to be to hit reliably?
Try this: Set s to a small value (say 5), draw a few samples, and notice how often you miss. Then drag s up to 40–50 and draw again — hits become near-certain. The curve on the right shows why.
If you have worked with stochastic gradient descent, the pattern is familiar2. SGD replaces the full gradient computation (over all data points) with a noisy estimate computed on a mini-batch. The estimate is biased on any single step, but the cumulative effect over many steps converges to the right answer, with controllable variance. Stochastic Greedy operates on the same principle: each step is noisier than full greedy, but the overall solution quality degrades only by a small, tunable factor .
The Stochastic Greedy Algorithm
Mirzasoleiman et al. (2015)3 introduced this algorithm under the evocative title “Lazier Than Lazy Greedy.” The entire modification to the standard greedy algorithm fits in a single line: the for each over becomes a for each over a random sample :
The sample size is the only new parameter. It controls the tradeoff between speed and approximation quality:
- Smaller larger closer to full greedy better guarantee, slower.
- Larger smaller more aggressive subsampling faster, weaker guarantee.
The sample is re-drawn independently at every step, which means Stochastic Greedy covers the entire ground set in expectation over iterations.
Runtime Analysis
Evaluations per step: instead of .
Total evaluations: .
The factor of in the greedy runtime has been replaced by . For any fixed , this is a constant, so the total cost is essentially linear in .
A concrete comparison makes the difference tangible:
| Evaluations | ||||
|---|---|---|---|---|
| Greedy | - | |||
| Stochastic Greedy |
That is a ~217x speedup. At 1ms per evaluation, the wall-clock time drops from 11.6 days to 77 minutes.
The speedup factor is . As grows, the advantage of Stochastic Greedy becomes more pronounced. For and , the speedup is 2,174x.
The Approximation Guarantee
Theorem (Mirzasoleiman et al. 2015). Let be a monotone submodular function, a positive integer, and . The Stochastic Greedy algorithm returns a set satisfying:
The guarantee is in expectation, compared to the deterministic of standard greedy. The additive loss of is the price we pay for subsampling; it can be made arbitrarily small by increasing .
Why Does It Still Work? Proof Intuition
The proof follows the same three-step structure as the greedy proof from Part 2, with a probabilistic layer on top.
Step 1: Bounding a single step. Recall that in the standard greedy proof, we showed that the element picked by greedy satisfies , where is the gap at step . The argument relied on the fact that the elements of collectively account for at least in marginal gain, so at least one of them has gain , and greedy picks the best over all elements.
With Stochastic Greedy, we pick the best over a random sample of size . The probability that contains none of the “good” elements from is at most4:
With probability , the sample contains at least one element with marginal gain . Stochastic Greedy picks the best in , so it does at least as well.
Step 2: Compounding. Just as before, the gap shrinks geometrically at each step. The per-step shrinkage factor is slightly worse, instead of , because we occasionally miss all good elements (with probability ).
Step 3: The limit. After steps, the expected gap is bounded by:
Rearranging: .
The structure is identical to the deterministic proof. The only difference is the leakage at each step, which accumulates to an additive in the final bound. The proof is clean because submodularity gives the per-step bound, random sampling gives the “hit probability,” and the rest is the same geometric compounding from Part 2.
Comparison Table
| Property | Greedy | Lazy Greedy | Stochastic Greedy |
|---|---|---|---|
| Evaluations per step | (worst case) | ||
| Total evaluations | (worst case) | ||
| Approximation ratio | (expected) | ||
| Deterministic | Yes | Yes | No |
| Practical speedup over Greedy | Baseline | 5x–100x (instance-dependent) | (predictable) |
The tradeoff is explicit: Stochastic Greedy trades a small, tunable approximation loss for a predictable runtime improvement of factor . Lazy Greedy can sometimes match or beat this in practice, but offers no worst-case guarantee beyond standard greedy.
Practical Guidance
Small (under 10,000). Standard Greedy or Lazy Greedy is fine. The overhead of random sampling and the degradation are not worth it when the full scan over is already cheap.
Moderate , small . Lazy Greedy tends to perform well here. When is small, the priority queue rarely needs many re-evaluations per step, and the deterministic guarantee of is preferable to the expected .
Large , moderate-to-large . This is where Stochastic Greedy shines. The runtime is independent of , making it the only option that scales gracefully when both and are large.
Choice of . A common conservative choice is 5. In practice, or even works well; the theoretical bound is a worst case, and real-world instances tend to behave better. The corresponding values are and , respectively, which keeps the sample size small.
Key Takeaways
- Standard Greedy requires evaluations, which becomes prohibitive when both and are large.
- Stochastic Greedy replaces the full scan at each step with a random subsample of size , reducing the total cost to .
- The approximation guarantee degrades from to in expectation, where is a tunable parameter. For practical values of , the speedup factor is , often 100x or more.
- The algorithm is trivial to implement: the only change from standard greedy is replacing the scan over with a scan over a random sample .
Further Reading
- B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrák, and A. Krause, “Lazier Than Lazy Greedy” (2015), the original Stochastic Greedy paper.
- A. Badanidiyuru and J. Vondrák, “Fast Algorithms for Maximizing Submodular Functions”6 (2014), the Decreasing Threshold Greedy framework.
- A. Krause and D. Golovin, “Submodular Function Maximization” (2014), a comprehensive survey covering greedy, continuous relaxation, and streaming methods.