Probability & Statistics for Machine Learning - Complete Cheat Sheet

1. Probability Basics

Core Concepts

Sample Space (Ω): Set of all possible outcomes

Event (A): Subset of sample space

Probability: P(A) = favorable outcomes / total outcomes

Axioms

1. 0 ≤ P(A) ≤ 1

2. P(Ω) = 1

3. P(A ∪ B) = P(A) + P(B) if A and B are disjoint

Conditional Probability

P(A|B) = P(A ∩ B) / P(B)

Read as: "Probability of A given B"

Bayes' Theorem

P(A|B) = [P(B|A) × P(A)] / P(B)

Or more usefully:

P(A|B) = [P(B|A) × P(A)] / [P(B|A)P(A) + P(B|¬A)P(¬A)]

In ML terms

P(hypothesis|data) = [P(data|hypothesis) × P(hypothesis)] / P(data)

Posterior = (Likelihood × Prior) / Evidence

Classic Example: Disease Testing

Given:

• P(Disease) = 0.01 (1% have disease)

• P(Test+|Disease) = 0.99 (99% sensitivity)

• P(Test+|No Disease) = 0.05 (5% false positive)

If test is positive, what's P(Disease)?

P(D|T+) = [0.99 × 0.01] / [0.99×0.01 + 0.05×0.99]

= 0.0099 / 0.0594

= 0.167 = 16.7%

Surprising! Even with positive test, only 16.7% chance!

Independence

Events A and B are independent if:

P(A ∩ B) = P(A) × P(B)

Or equivalently: P(A|B) = P(A)

Chain Rule of Probability

P(A₁, A₂, ..., Aₙ) = P(A₁) × P(A₂|A₁) × P(A₃|A₁,A₂) × ... × P(Aₙ|A₁,...,Aₙ₋₁)

Used in: Autoregressive models, language models

2. Random Variables

Types

Type	Definition	Example
Discrete	Countable outcomes	{1,2,3,4,5,6}
Continuous	Uncountable outcomes	[0, ∞)

Probability Mass Function (PMF) - Discrete

P(X = x) = probability that X takes value x

Properties

• P(X = x) ≥ 0

• Σ P(X = x) = 1 (sum over all x)

Example: Fair die

P(X = k) = 1/6 for k ∈ {1,2,3,4,5,6}

Probability Density Function (PDF) - Continuous

f(x) = probability density at x

Properties

• f(x) ≥ 0

• ∫ f(x)dx = 1 (integral over all x)

• P(a ≤ X ≤ b) = ∫ₐᵇ f(x)dx

• Note: P(X = x) = 0 for continuous variables!

Cumulative Distribution Function (CDF)

F(x) = P(X ≤ x)

For discrete: F(x) = Σ P(X = k) for k ≤ x

For continuous: F(x) = ∫₋∞ˣ f(t)dt

Properties

• F(x) is non-decreasing

• lim F(x) = 0 as x → -∞

• lim F(x) = 1 as x → ∞

• P(a < X ≤ b) = F(b) - F(a)

Expected Value (Mean)

Discrete: E[X] = Σ x·P(X = x)

Continuous: E[X] = ∫ x·f(x)dx

Symbol: μ or E[X]

Properties

E[aX + b] = aE[X] + b (linearity)

E[X + Y] = E[X] + E[Y] (even if not independent!)

E[XY] = E[X]·E[Y] (only if independent)

Variance

Var(X) = E[(X - μ)²] = E[X²] - (E[X])²

Symbol: σ² or Var(X)

Standard Deviation: σ = √Var(X)

Properties

Var(aX + b) = a²Var(X)

Var(X + Y) = Var(X) + Var(Y) (only if independent!)

Example: Fair die X ∈ {1,2,3,4,5,6}

E[X] = (1+2+3+4+5+6)/6 = 3.5

E[X²] = (1+4+9+16+25+36)/6 = 15.17

Var(X) = 15.17 - 3.5² = 2.92

3. Common Distributions

Discrete Distributions

Bernoulli Distribution

Single trial with success/failure

X ∈ {0, 1}

P(X = 1) = p

P(X = 0) = 1 - p

E[X] = p

Var(X) = p(1-p)

Used in: Binary classification, coin flips

Binomial Distribution

Number of successes in n independent Bernoulli trials

X ~ Binomial(n, p)

X ∈ {0, 1, 2, ..., n}

P(X = k) = C(n,k) × pᵏ × (1-p)ⁿ⁻ᵏ

where C(n,k) = n!/(k!(n-k)!)

E[X] = np

Var(X) = np(1-p)

Used in: Number of heads in n coin flips

Poisson Distribution

Number of events in fixed time interval

X ~ Poisson(λ)

X ∈ {0, 1, 2, ...}

P(X = k) = (λᵏ × e⁻ᵏ) / k!

E[X] = λ

Var(X) = λ

Used in: Website visitors per hour, system failures

Categorical Distribution

Generalization of Bernoulli to k outcomes

X ∈ {1, 2, ..., k}

P(X = i) = pᵢ where Σpᵢ = 1

Used in: Multi-class classification (softmax output!)

Continuous Distributions

Uniform Distribution

All values equally likely in [a, b]

X ~ Uniform(a, b)

f(x) = 1/(b-a) for a ≤ x ≤ b

= 0 otherwise

E[X] = (a+b)/2

Var(X) = (b-a)²/12

Used in: Random initialization, sampling

Normal/Gaussian Distribution

X ~ N(μ, σ²)

PDF: f(x) = (1/√(2πσ²)) × exp(-(x-μ)²/(2σ²))

E[X] = μ

Var(X) = σ²

Standard Normal: N(0, 1)

z = (x - μ)/σ (standardization)

Properties

✓ Symmetric around mean

✓ Bell-shaped curve

✓ 68% within 1σ, 95% within 2σ, 99.7% within 3σ

✓ Sum of Normals is Normal

✓ Linear transformation stays Normal

Why it's everywhere

• Central Limit Theorem → many things become Normal!

• Weight initialization (Xavier/He)

• Gaussian noise in data

• Maximum entropy distribution

• Easy to work with mathematically

Exponential Distribution

Time until event occurs

X ~ Exp(λ)

PDF: f(x) = λe⁻ᵏˣ for x ≥ 0

E[X] = 1/λ

Var(X) = 1/λ²

Used in: Time between events, survival analysis

Beta Distribution

Distribution over probabilities [0, 1]

X ~ Beta(α, β)

x ∈ [0, 1]

E[X] = α/(α+β)

Used in: Bayesian priors for probabilities, A/B testing

4. Multivariate Distributions

Joint Probability

P(X = x, Y = y)

probability that X=x AND Y=y

Discrete: P(X=x, Y=y)

Continuous: f(x,y) = joint PDF

Marginal Probability

Get distribution of one variable from joint

Discrete: P(X = x) = Σᵧ P(X=x, Y=y)

Continuous: f(x) = ∫ f(x,y)dy

Conditional Distribution

P(Y=y|X=x) = P(X=x, Y=y) / P(X=x)

f(y|x) = f(x,y) / f(x)

Covariance

Measure of how two variables vary together

Cov(X, Y) = E[(X - E[X])(Y - E[Y])]

= E[XY] - E[X]E[Y]

Properties

• Cov(X, X) = Var(X)

• Cov(X, Y) = Cov(Y, X) (symmetric)

• Cov(X, Y) = 0 if X, Y independent

• Positive → tend to increase together

• Negative → one increases, other decreases

Correlation

Normalized covariance (scale-free)

Corr(X, Y) = ρ = Cov(X,Y) / (σₓσᵧ)

Range: -1 ≤ ρ ≤ 1

• ρ = 1 → perfect positive linear relationship

• ρ = 0 → no linear relationship

• ρ = -1 → perfect negative linear relationship

⚠️ Correlation ≠ Causation!

Multivariate Gaussian

⚡ Critical for ML!

X ~ N(μ, Σ)

where:

• μ = mean vector (d dimensions)

• Σ = covariance matrix (d × d)

PDF: f(x) = (1/√((2π)ᵈ|Σ|)) × exp(-½(x-μ)ᵀΣ⁻¹(x-μ))

Properties

• Marginals are Gaussian

• Conditionals are Gaussian

• Linear transformations stay Gaussian

Used in

• Gaussian Mixture Models (GMM)

• Variational Autoencoders (VAE)

• Kalman filters

• Gaussian processes

Covariance Matrix:

Σᵢⱼ = Cov(Xᵢ, Xⱼ)

Diagonal = variances, Off-diagonal = covariances

5. Key Theorems

Law of Large Numbers (LLN)

Sample average converges to expected value as n → ∞

x̄ₙ = (1/n)Σxᵢ → E[X] as n → ∞

Why it matters

• Justifies using sample mean to estimate population mean

• Foundation of Monte Carlo methods

Central Limit Theorem (CLT)

⚡ ONE OF THE MOST IMPORTANT!

Sum of many independent random variables → Normal distribution

If X₁, X₂, ..., Xₙ are i.i.d. with mean μ and variance σ²:

Sₙ = X₁ + X₂ + ... + Xₙ

Then: (Sₙ - nμ) / (σ√n) → N(0, 1) as n → ∞

Or for sample mean: √n(x̄ - μ)/σ → N(0, 1)

Why it's HUGE

• Explains why Normal distribution appears everywhere!

• Justifies confidence intervals

• Foundation of hypothesis testing

• Works even if original distribution is not Normal!

Example:

Roll a die 1000 times and average → approximately Normal! Even though die roll is uniform, not Normal.

Jensen's Inequality

For convex function f:

f(E[X]) ≤ E[f(X)]

For concave function:

f(E[X]) ≥ E[f(X)]

Used in: Information theory, variational inference, optimization bounds

6. Estimation

Maximum Likelihood Estimation (MLE)

⚡ Super important for ML!

Idea: Find parameters that maximize probability of observed data

θ̂_MLE = argmax L(θ|data)

= argmax P(data|θ)

In practice, maximize log-likelihood:

θ̂_MLE = argmax log L(θ|data)

Why log?

• Product becomes sum

• Numerically stable

• Easier to differentiate

Example: Estimate p for coin flips

Data: n flips, k heads

Likelihood: L(p) = C(n,k) × pᵏ × (1-p)ⁿ⁻ᵏ

Log-likelihood: ℓ(p) = log C(n,k) + k log(p) + (n-k)log(1-p)

Take derivative, set to 0:

dℓ/dp = k/p - (n-k)/(1-p) = 0

Solve: p̂_MLE = k/n (sample proportion!)

MLE for Normal distribution:

Given data x₁, x₂, ..., xₙ ~ N(μ, σ²)

μ̂_MLE = x̄ = (1/n)Σxᵢ (sample mean)

σ̂²_MLE = (1/n)Σ(xᵢ - x̄)² (sample variance)

Maximum A Posteriori (MAP)

MLE + prior information (Bayesian approach)

θ̂_MAP = argmax P(θ|data)

= argmax P(data|θ) × P(θ) [by Bayes]

= argmax [log P(data|θ) + log P(θ)]

MLE: log P(data|θ)

MAP: log P(data|θ) + log P(θ)

↑ likelihood ↑ prior

Connection to Regularization

MAP with Gaussian prior → L2 regularization (Ridge)

MAP with Laplace prior → L1 regularization (Lasso)

Loss = -log P(data|θ) - log P(θ)

= NLL + regularization term

Bias and Variance of Estimators

Bias(θ̂) = E[θ̂] - θ

Unbiased: E[θ̂] = θ

Consistent: θ̂ → θ as n → ∞

Variance: Var(θ̂) = how much estimator varies

Mean Squared Error:

MSE(θ̂) = Bias² + Variance

Confidence Intervals

Range that contains true parameter with certain probability

95% CI: [θ̂ - 1.96·SE, θ̂ + 1.96·SE]

where SE = standard error = σ/√n

Interpretation: If we repeat experiment many times, 95% of intervals will contain true parameter.

7. Hypothesis Testing

Framework

1. Null Hypothesis (H₀): Status quo (no effect)

2. Alternative Hypothesis (H₁): What we want to show

3. Test statistic: Computed from data

4. p-value: P(observe data | H₀ is true)

5. Decision: Reject H₀ if p-value < α (significance level)

p-value

p-value = probability of observing data (or more extreme) assuming H₀ is true

Common threshold: α = 0.05

If p < 0.05 → Reject H₀ (statistically significant)

If p ≥ 0.05 → Fail to reject H₀

⚠️ Common Misconception:

❌ p-value is NOT probability that H₀ is true

✓ p-value is probability of data given H₀ is true

Type I and Type II Errors

	H₀ True	H₀ False
Reject H₀	Type I Error (α)	✓ Correct
Fail to Reject	✓ Correct	Type II Error (β)

Type I Error (False Positive): Reject H₀ when it's true - Probability = α

Type II Error (False Negative): Fail to reject H₀ when it's false - Probability = β

Power = 1 - β: Probability of correctly rejecting false H₀

Common Tests

z-test

For large samples (n > 30) with known σ

z = (x̄ - μ₀) / (σ/√n)

If |z| > 1.96 → reject H₀ at α=0.05

t-test

For small samples or unknown σ

t = (x̄ - μ₀) / (s/√n)

where s = sample standard deviation

Compare t to t-distribution with (n-1) degrees of freedom

Chi-square test

For categorical data

χ² = Σ (Observed - Expected)² / Expected

Used for: goodness-of-fit, independence testing

A/B Testing

⚡ Super practical for ML!

Goal: Compare two variants (A and B)

Setup:

• Randomly assign users to A or B

• Measure conversion rate (or other metric)

• Test if difference is statistically significant

Test: Two-sample z-test or t-test

H₀: p_A = p_B (no difference)

H₁: p_A ≠ p_B (there is difference)

Calculate:

z = (p̂_A - p̂_B) / SE

where SE = √[p̂(1-p̂)(1/n_A + 1/n_B)]

p̂ = pooled proportion

Important:

• Need sufficient sample size (power analysis)

• Beware of multiple testing

• Consider practical vs statistical significance

8. Information Theory

Entropy (Shannon Entropy)

Measure of uncertainty/information content

H(X) = -Σ P(x) log₂ P(x)

or with natural log:

H(X) = -Σ P(x) ln P(x)

Units: bits (log₂) or nats (ln)

Properties

• H(X) ≥ 0

• H(X) = 0 iff X is deterministic

• Maximum when distribution is uniform

• For uniform over n outcomes: H = log₂(n)

Examples

Fair coin: P(H)=0.5, P(T)=0.5

H = -0.5 log₂(0.5) - 0.5 log₂(0.5) = 1 bit

Certain outcome: P(A)=1

H = -1 log₂(1) = 0 bits (no uncertainty!)

Biased coin: P(H)=0.9, P(T)=0.1

H = -0.9 log₂(0.9) - 0.1 log₂(0.1) = 0.47 bits

Cross-Entropy

⚡ THE LOSS FUNCTION!

H(p, q) = -Σ p(x) log q(x)

where:

• p = true distribution

• q = predicted distribution

Used in: Classification loss (cross-entropy loss)

Binary Cross-Entropy

BCE = -[y log(ŷ) + (1-y) log(1-ŷ)]

where:

• y = true label (0 or 1)

• ŷ = predicted probability

Categorical Cross-Entropy

CCE = -Σ yᵢ log(ŷᵢ)

where:

• y = one-hot encoded true label

• ŷ = softmax output (predicted probabilities)

Example:

True: [0, 1, 0] (class 2)

Pred: [0.1, 0.7, 0.2]

CCE = -[0×log(0.1) + 1×log(0.7) + 0×log(0.2)]

= -log(0.7) = 0.36

Why Cross-Entropy for Classification?

✓ Derived from MLE for categorical distribution

✓ Penalizes confident wrong predictions heavily

✓ Gradient works well (no vanishing)

✓ Probabilistic interpretation

KL Divergence (Kullback-Leibler)

Measure of difference between two distributions

KL(p||q) = Σ p(x) log[p(x)/q(x)]

= H(p,q) - H(p)

= Cross-Entropy - Entropy

Properties

• KL(p||q) ≥ 0

• KL(p||q) = 0 iff p = q

• NOT symmetric: KL(p||q) ≠ KL(q||p)

• NOT a distance metric

Used in: VAE, RL (policy optimization), model comparison

Mutual Information

Measure of dependence between variables

I(X;Y) = H(X) + H(Y) - H(X,Y)

= KL(p(x,y) || p(x)p(y))

I(X;Y) = 0 iff X and Y are independent

Used in: Feature selection, information theory in neural networks, causality

9. Key ML Concepts

Bias-Variance Tradeoff

⚡ CRITICAL!

Total Error = Bias² + Variance + Irreducible Error

Bias

• Error from wrong assumptions

• High bias → underfitting

• Model too simple

Variance

• Error from sensitivity to data

• High variance → overfitting

• Model too complex

Examples:

• Linear regression on non-linear data → High Bias

• Deep network on small dataset → High Variance

Solutions:

• High Bias: Add features, increase model complexity

• High Variance: More data, regularization, dropout

MLE Connection to Loss Functions

Classification (Bernoulli):

MLE → minimize -log P(y|x)

Loss = Binary Cross-Entropy

Multi-class (Categorical):

MLE → minimize -log P(y|x)

Loss = Categorical Cross-Entropy

Regression (Gaussian):

MLE → minimize -log P(y|x)

Loss = Mean Squared Error (MSE)

General pattern:

Loss = -log Likelihood

Regularization as Prior

L2 Regularization = Gaussian Prior

Loss = MSE + λ||w||₂²

= -log P(data|w) - log P(w)

where P(w) ~ N(0, 1/λ)

L1 Regularization = Laplace Prior

Loss = MSE + λ||w||₁

= -log P(data|w) - log P(w)

where P(w) ~ Laplace(0, 1/λ)

Probabilistic Interpretation of Neural Networks

Softmax output = Categorical distribution over classes

P(y = k | x) = exp(zₖ) / Σexp(zᵢ)

Binary sigmoid = Bernoulli distribution

P(y = 1 | x) = σ(z) = 1/(1 + e⁻ᶻ)

Regression output = Gaussian mean

P(y | x) = N(f(x), σ²)

Sampling Techniques in ML

Monte Carlo: Average over random samples

Bootstrap: Resample with replacement (estimate uncertainty)

Cross-Validation: Split data for model evaluation

Importance Sampling: Sample from different distribution

MCMC: Sample complex distributions

10. Quick Formulas Reference

Expectation

E[aX + b] = aE[X] + b

E[X + Y] = E[X] + E[Y]

E[XY] = E[X]E[Y] (if independent)

E[X²] = Var(X) + (E[X])²

Variance

Var(aX + b) = a²Var(X)

Var(X + Y) = Var(X) + Var(Y) (if indep)

Var(X) = E[X²] - (E[X])²

Covariance

Cov(X, Y) = E[XY] - E[X]E[Y]

Cov(X, X) = Var(X)

Cov(aX, bY) = ab·Cov(X,Y)

Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)

Standard Distributions

Distribution	Mean	Variance
Bernoulli(p)	p	p(1-p)
Binomial(n,p)	np	np(1-p)
Poisson(λ)	λ	λ
Uniform(a,b)	(a+b)/2	(b-a)²/12
Normal(μ,σ²)	μ	σ²
Exponential(λ)	1/λ	1/λ²

11. Python Implementation

# Distributions

mu, sigma = 0, 1

x = np.random.normal(mu, sigma, 1000)

binomial = np.random.binomial(n=10, p=0.5, size=1000)

poisson = np.random.poisson(lam=5, size=1000)

# PDF, CDF

from scipy.stats import norm

pdf = norm.pdf(x, mu, sigma)

cdf = norm.cdf(x, mu, sigma)

# Statistics

mean = np.mean(x)

std = np.std(x, ddof=1)

var = np.var(x, ddof=1)

cov = np.cov(x, y)

corr = np.corrcoef(x, y)

# Hypothesis Testing

t_stat, p_value = stats.ttest_1samp(x, popmean=0)

t_stat, p_value = stats.ttest_ind(group_a, group_b)

chi2, p_value = stats.chisquare(observed, expected)

# Confidence Intervals

se = stats.sem(x)

ci = stats.t.interval(0.95, len(x)-1, loc=mean, scale=se)

# Entropy & Cross-Entropy

def entropy(p):

p = p[p > 0]

return -np.sum(p * np.log2(p))

def cross_entropy(p, q):

return -np.sum(p * np.log(q))

def bce(y_true, y_pred):

return -np.mean(y_true * np.log(y_pred) +

(1 - y_true) * np.log(1 - y_pred))

# A/B Testing

def ab_test(conv_a, total_a, conv_b, total_b):

p_a = conv_a / total_a

p_b = conv_b / total_b

p_pool = (conv_a + conv_b) / (total_a + total_b)

se = np.sqrt(p_pool * (1-p_pool) * (1/total_a + 1/total_b))

z = (p_a - p_b) / se

p_value = 2 * (1 - stats.norm.cdf(abs(z)))

return z, p_value

# Bootstrap

def bootstrap(data, n_iter=1000, func=np.mean):

results = []

for _ in range(n_iter):

sample = np.random.choice(data, size=len(data), replace=True)

results.append(func(sample))

return np.array(results)

12. Summary - Must-Know Concepts

Top 10 Concepts (Memorize These!)

1. Bayes' Theorem: P(A|B) = P(B|A)P(A)/P(B)

2. Expected Value: E[X] = Σx·P(x)

3. Variance: Var(X) = E[X²] - (E[X])²

4. Normal Distribution: N(μ, σ²), 68-95-99.7 rule

5. MLE: argmax log P(data|θ)

6. Cross-Entropy: -Σy log(ŷ)

7. Central Limit Theorem: Sum → Normal

8. Bias-Variance: Total Error = Bias² + Variance + ε

9. p-value: P(data | H₀ true)

10. Entropy: H(X) = -Σp log(p)

Key Connections to ML

Bayes → Bayesian ML, priors, posteriors

MLE → Loss functions (cross-entropy, MSE)

MAP → Regularization (L1, L2)

Gaussian → Weight init, noise modeling

Bernoulli/Categorical → Classification

Cross-Entropy → Classification loss

KL Divergence → VAE, RL

Bias-Variance → Model selection

Hypothesis Testing → A/B testing

Sampling → Monte Carlo, bootstrap

Practice Problems

1. If P(A) = 0.3, P(B|A) = 0.7, what is P(A ∩ B)?

2. X ~ N(10, 4). What is P(X < 12)?

3. Calculate entropy of a fair 6-sided die

4. Given y = [0,1,0], ŷ = [0.2, 0.6, 0.2], find cross-entropy

5. Var(3X + 5) = ? if Var(X) = 4

Show Answers

1. P(A ∩ B) = P(B|A)×P(A) = 0.7×0.3 = 0.21

2. z = (12-10)/2 = 1, P(Z<1) ≈ 0.8413

3. H = -Σ(1/6)log₂(1/6) = log₂(6) ≈ 2.58 bits

4. CE = -[0×log(0.2) + 1×log(0.6) + 0×log(0.2)] = -log(0.6) ≈ 0.51

5. Var(3X + 5) = 3²×Var(X) = 9×4 = 36

Pro Tips

For ML/DL

Focus on Bayes, MLE, Cross-Entropy, and distributions - they're the foundation of modern ML.

For Interviews

Be able to explain CLT, bias-variance tradeoff, and why we use cross-entropy for classification.

Practice

Work through probability problems daily. Understand the intuition, not just formulas.