Why Probability Matters for CS

Without probability, you can only talk about worst-case and best-case performance, and those extremes often misrepresent what actually happens. Average-case analysis gives a much more realistic picture. Beyond analysis, randomized algorithms (quicksort with random pivots, randomized primality testing, sketching algorithms) explicitly use probability to get better expected performance or simpler implementations than their deterministic counterparts. Discrete probability, where outcomes are countable, covers the vast majority of what algorithm analysis needs.

Sample Spaces and Events

A sample space is the set of all possible outcomes of an experiment. An event is a subset of outcomes. The probability function assigns a value to each event, with .

Basic rules:

  • Complement:
  • Union:
  • Independence: and are independent if

Warning

Independence and mutual exclusivity are not the same thing. Mutually exclusive events (if one happens, the other can’t) are actually dependent unless one of them has probability 0. This trips people up more often than it should.

Conditional Probability

The probability of given has occurred:

Conditioning is the mechanism that lets you update beliefs as you learn new information. It’s the foundation of Bayesian reasoning and shows up everywhere from spam filters to medical diagnosis.

Bayes’ Theorem

This is the theorem that makes statistics and machine learning work. It reverses the conditioning direction:

With a partition of , the denominator expands via the law of total probability:

Example

Spam filtering. Let = “email is spam” and = “email contains word ‘free’.” Given , , and : An email containing “free” has about a 77% chance of being spam. Naive Bayes classifiers extend this across many features and remain competitive for text classification despite their simplifying assumptions.

Expected Value

A random variable assigns a number to each outcome. The expected value (mean):

Two properties that get used constantly:

  • Linearity of expectation: , always, even when and are dependent. This is the single most useful fact in probabilistic algorithm analysis.
  • Variance: , measures spread around the mean.

Tip

Linearity of expectation is shockingly powerful because it requires no independence assumption. You can decompose a complicated random variable into simple indicator variables, compute each expectation trivially, and sum them up. The randomized quicksort analysis and the coupon collector problem both work this way.

Common Distributions

DistributionSetup
BernoulliSingle trial, success probability
Binomial independent trials, each with probability
GeometricTrials until first success

Concentration Inequalities

Expected value tells you the average, but you often need to know how tightly a random variable clusters around that average.

  • Markov’s inequality: for non-negative , . Crude but universally applicable.
  • Chebyshev’s inequality: . Uses variance for a tighter bound.
  • Chernoff bounds: exponentially tight bounds for sums of independent random variables. This is the workhorse for analyzing randomized algorithms and proving “with high probability” guarantees.

Note

The progression Markov Chebyshev Chernoff represents increasing tightness at the cost of stronger assumptions. Markov needs non-negativity. Chebyshev needs finite variance. Chernoff needs independence. Knowing when to reach for which one is a recurring theme in algorithm analysis.

Classic Applications

Average-case linear search. Searching for a key in an unsorted array of elements where each position is equally likely:

On average, you check about half the array. This is one of the simplest expected value calculations and a good one to internalize.

Randomized quicksort. Choosing a pivot uniformly at random gives expected comparisons regardless of input order. The analysis defines indicator variables if elements and are ever compared, computes , and sums over all pairs. Linearity of expectation does all the heavy lifting.

Coupon collector. To collect all distinct coupons when each draw is uniform random, the expected draws needed is . The trick is decomposing into geometric waiting times and summing expectations.

Birthday paradox. In a group of people, the probability that at least two share a birthday (out of 365 days) exceeds 50% when . The CS version: with hash slots, expect a collision after roughly insertions. This is why hash tables need to be sized generously.