Why Probability Matters for CS
Without probability, you can only talk about worst-case and best-case performance, and those extremes often misrepresent what actually happens. Average-case analysis gives a much more realistic picture. Beyond analysis, randomized algorithms (quicksort with random pivots, randomized primality testing, sketching algorithms) explicitly use probability to get better expected performance or simpler implementations than their deterministic counterparts. Discrete probability, where outcomes are countable, covers the vast majority of what algorithm analysis needs.
Sample Spaces and Events
A sample space is the set of all possible outcomes of an experiment. An event is a subset of outcomes. The probability function assigns a value to each event, with .
Basic rules:
- Complement:
- Union:
- Independence: and are independent if
Warning
Independence and mutual exclusivity are not the same thing. Mutually exclusive events (if one happens, the other can’t) are actually dependent unless one of them has probability 0. This trips people up more often than it should.
Conditional Probability
The probability of given has occurred:
Conditioning is the mechanism that lets you update beliefs as you learn new information. It’s the foundation of Bayesian reasoning and shows up everywhere from spam filters to medical diagnosis.
Bayes’ Theorem
This is the theorem that makes statistics and machine learning work. It reverses the conditioning direction:
With a partition of , the denominator expands via the law of total probability:
Example
Spam filtering. Let = “email is spam” and = “email contains word ‘free’.” Given , , and : An email containing “free” has about a 77% chance of being spam. Naive Bayes classifiers extend this across many features and remain competitive for text classification despite their simplifying assumptions.
Expected Value
A random variable assigns a number to each outcome. The expected value (mean):
Two properties that get used constantly:
- Linearity of expectation: , always, even when and are dependent. This is the single most useful fact in probabilistic algorithm analysis.
- Variance: , measures spread around the mean.
Tip
Linearity of expectation is shockingly powerful because it requires no independence assumption. You can decompose a complicated random variable into simple indicator variables, compute each expectation trivially, and sum them up. The randomized quicksort analysis and the coupon collector problem both work this way.
Common Distributions
| Distribution | Setup | |
|---|---|---|
| Bernoulli | Single trial, success probability | |
| Binomial | independent trials, each with probability | |
| Geometric | Trials until first success |
Concentration Inequalities
Expected value tells you the average, but you often need to know how tightly a random variable clusters around that average.
- Markov’s inequality: for non-negative , . Crude but universally applicable.
- Chebyshev’s inequality: . Uses variance for a tighter bound.
- Chernoff bounds: exponentially tight bounds for sums of independent random variables. This is the workhorse for analyzing randomized algorithms and proving “with high probability” guarantees.
Note
The progression Markov → Chebyshev → Chernoff represents increasing tightness at the cost of stronger assumptions. Markov needs non-negativity. Chebyshev needs finite variance. Chernoff needs independence. Knowing when to reach for which one is a recurring theme in algorithm analysis.
Classic Applications
Average-case linear search. Searching for a key in an unsorted array of elements where each position is equally likely:
On average, you check about half the array. This is one of the simplest expected value calculations and a good one to internalize.
Randomized quicksort. Choosing a pivot uniformly at random gives expected comparisons regardless of input order. The analysis defines indicator variables if elements and are ever compared, computes , and sums over all pairs. Linearity of expectation does all the heavy lifting.
Coupon collector. To collect all distinct coupons when each draw is uniform random, the expected draws needed is . The trick is decomposing into geometric waiting times and summing expectations.
Birthday paradox. In a group of people, the probability that at least two share a birthday (out of 365 days) exceeds 50% when . The CS version: with hash slots, expect a collision after roughly insertions. This is why hash tables need to be sized generously.
Related Notes
- Best, Worst & Average Cases - probability underpins average-case complexity analysis
- Combinatorics - counting outcomes is the foundation of computing probabilities
- Linear Algebra Fundamentals - Markov chains connect probability with matrix methods
- Graph Theory - random graphs and probabilistic methods in combinatorics