Probability & Statistics
FOUNDATIONS

Probability & Statistics for Machine Learning

Complete reference covering distributions, Bayes theorem, MLE, hypothesis testing, and information theory - everything you need for ML/DL.

1. Probability Basics

Core Concepts

Sample Space (Ω): Set of all possible outcomes
Event (A): Subset of sample space
Probability: P(A) = favorable outcomes / total outcomes

Axioms

1. 0 ≤ P(A) ≤ 1
2. P(Ω) = 1
3. P(A ∪ B) = P(A) + P(B) if A and B are disjoint

Conditional Probability

P(A|B) = P(A ∩ B) / P(B)

Read as: "Probability of A given B"

Bayes' Theorem

⚡ THE MOST IMPORTANT!

P(A|B) = [P(B|A) × P(A)] / P(B)
Or more usefully:
P(A|B) = [P(B|A) × P(A)] / [P(B|A)P(A) + P(B|¬A)P(¬A)]

In ML terms

P(hypothesis|data) = [P(data|hypothesis) × P(hypothesis)] / P(data)
Posterior = (Likelihood × Prior) / Evidence

Classic Example: Disease Testing

Given:
• P(Disease) = 0.01 (1% have disease)
• P(Test+|Disease) = 0.99 (99% sensitivity)
• P(Test+|No Disease) = 0.05 (5% false positive)
If test is positive, what's P(Disease)?
P(D|T+) = [0.99 × 0.01] / [0.99×0.01 + 0.05×0.99]
= 0.0099 / 0.0594
= 0.167 = 16.7%
Surprising! Even with positive test, only 16.7% chance!

Independence

Events A and B are independent if:

P(A ∩ B) = P(A) × P(B)
Or equivalently: P(A|B) = P(A)

Chain Rule of Probability

P(A₁, A₂, ..., Aₙ) = P(A₁) × P(A₂|A₁) × P(A₃|A₁,A₂) × ... × P(Aₙ|A₁,...,Aₙ₋₁)

Used in: Autoregressive models, language models

2. Random Variables

Types

TypeDefinitionExample
DiscreteCountable outcomes{1,2,3,4,5,6}
ContinuousUncountable outcomes[0, ∞)

Probability Mass Function (PMF) - Discrete

P(X = x) = probability that X takes value x

Properties

• P(X = x) ≥ 0
• Σ P(X = x) = 1 (sum over all x)

Example: Fair die

P(X = k) = 1/6 for k ∈ {1,2,3,4,5,6}

Probability Density Function (PDF) - Continuous

f(x) = probability density at x

Properties

• f(x) ≥ 0
• ∫ f(x)dx = 1 (integral over all x)
• P(a ≤ X ≤ b) = ∫ₐᵇ f(x)dx
• Note: P(X = x) = 0 for continuous variables!

Cumulative Distribution Function (CDF)

F(x) = P(X ≤ x)
For discrete: F(x) = Σ P(X = k) for k ≤ x
For continuous: F(x) = ∫₋∞ˣ f(t)dt

Properties

• F(x) is non-decreasing
• lim F(x) = 0 as x → -∞
• lim F(x) = 1 as x → ∞
• P(a < X ≤ b) = F(b) - F(a)

Expected Value (Mean)

Discrete: E[X] = Σ x·P(X = x)
Continuous: E[X] = ∫ x·f(x)dx
Symbol: μ or E[X]

Properties

E[aX + b] = aE[X] + b (linearity)
E[X + Y] = E[X] + E[Y] (even if not independent!)
E[XY] = E[X]·E[Y] (only if independent)

Variance

Var(X) = E[(X - μ)²] = E[X²] - (E[X])²
Symbol: σ² or Var(X)
Standard Deviation: σ = √Var(X)

Properties

Var(aX + b) = a²Var(X)
Var(X + Y) = Var(X) + Var(Y) (only if independent!)

Example: Fair die X ∈ {1,2,3,4,5,6}

E[X] = (1+2+3+4+5+6)/6 = 3.5
E[X²] = (1+4+9+16+25+36)/6 = 15.17
Var(X) = 15.17 - 3.5² = 2.92

3. Common Distributions

Discrete Distributions

Bernoulli Distribution

Single trial with success/failure

X ∈ {0, 1}
P(X = 1) = p
P(X = 0) = 1 - p
E[X] = p
Var(X) = p(1-p)

Used in: Binary classification, coin flips

Binomial Distribution

Number of successes in n independent Bernoulli trials

X ~ Binomial(n, p)
X ∈ {0, 1, 2, ..., n}
P(X = k) = C(n,k) × pᵏ × (1-p)ⁿ⁻ᵏ
where C(n,k) = n!/(k!(n-k)!)
E[X] = np
Var(X) = np(1-p)

Used in: Number of heads in n coin flips

Poisson Distribution

Number of events in fixed time interval

X ~ Poisson(λ)
X ∈ {0, 1, 2, ...}
P(X = k) = (λᵏ × e⁻ᵏ) / k!
E[X] = λ
Var(X) = λ

Used in: Website visitors per hour, system failures

Categorical Distribution

Generalization of Bernoulli to k outcomes

X ∈ {1, 2, ..., k}
P(X = i) = pᵢ where Σpᵢ = 1

Used in: Multi-class classification (softmax output!)

Continuous Distributions

Uniform Distribution

All values equally likely in [a, b]

X ~ Uniform(a, b)
f(x) = 1/(b-a) for a ≤ x ≤ b
= 0 otherwise
E[X] = (a+b)/2
Var(X) = (b-a)²/12

Used in: Random initialization, sampling

Normal/Gaussian Distribution

⚡ MOST IMPORTANT!

X ~ N(μ, σ²)
PDF: f(x) = (1/√(2πσ²)) × exp(-(x-μ)²/(2σ²))
E[X] = μ
Var(X) = σ²
Standard Normal: N(0, 1)
z = (x - μ)/σ (standardization)
Properties
✓ Symmetric around mean
✓ Bell-shaped curve
✓ 68% within 1σ, 95% within 2σ, 99.7% within 3σ
✓ Sum of Normals is Normal
✓ Linear transformation stays Normal
Why it's everywhere
• Central Limit Theorem → many things become Normal!
• Weight initialization (Xavier/He)
• Gaussian noise in data
• Maximum entropy distribution
• Easy to work with mathematically

Exponential Distribution

Time until event occurs

X ~ Exp(λ)
PDF: f(x) = λe⁻ᵏˣ for x ≥ 0
E[X] = 1/λ
Var(X) = 1/λ²

Used in: Time between events, survival analysis

Beta Distribution

Distribution over probabilities [0, 1]

X ~ Beta(α, β)
x ∈ [0, 1]
E[X] = α/(α+β)

Used in: Bayesian priors for probabilities, A/B testing

4. Multivariate Distributions

Joint Probability

P(X = x, Y = y)
probability that X=x AND Y=y
Discrete: P(X=x, Y=y)
Continuous: f(x,y) = joint PDF

Marginal Probability

Get distribution of one variable from joint

Discrete: P(X = x) = Σᵧ P(X=x, Y=y)
Continuous: f(x) = ∫ f(x,y)dy

Conditional Distribution

P(Y=y|X=x) = P(X=x, Y=y) / P(X=x)
f(y|x) = f(x,y) / f(x)

Covariance

Measure of how two variables vary together

Cov(X, Y) = E[(X - E[X])(Y - E[Y])]
= E[XY] - E[X]E[Y]

Properties

• Cov(X, X) = Var(X)
• Cov(X, Y) = Cov(Y, X) (symmetric)
• Cov(X, Y) = 0 if X, Y independent
• Positive → tend to increase together
• Negative → one increases, other decreases

Correlation

Normalized covariance (scale-free)

Corr(X, Y) = ρ = Cov(X,Y) / (σₓσᵧ)
Range: -1 ≤ ρ ≤ 1
• ρ = 1 → perfect positive linear relationship
• ρ = 0 → no linear relationship
• ρ = -1 → perfect negative linear relationship
⚠️ Correlation ≠ Causation!

Multivariate Gaussian

⚡ Critical for ML!

X ~ N(μ, Σ)
where:
• μ = mean vector (d dimensions)
• Σ = covariance matrix (d × d)
PDF: f(x) = (1/√((2π)ᵈ|Σ|)) × exp(-½(x-μ)ᵀΣ⁻¹(x-μ))

Properties

• Marginals are Gaussian
• Conditionals are Gaussian
• Linear transformations stay Gaussian

Used in

• Gaussian Mixture Models (GMM)
• Variational Autoencoders (VAE)
• Kalman filters
• Gaussian processes

Covariance Matrix:

Σᵢⱼ = Cov(Xᵢ, Xⱼ)

Diagonal = variances, Off-diagonal = covariances

5. Key Theorems

Law of Large Numbers (LLN)

Sample average converges to expected value as n → ∞
x̄ₙ = (1/n)Σxᵢ → E[X] as n → ∞

Why it matters

• Justifies using sample mean to estimate population mean
• Foundation of Monte Carlo methods

Central Limit Theorem (CLT)

⚡ ONE OF THE MOST IMPORTANT!

Sum of many independent random variables → Normal distribution

If X₁, X₂, ..., Xₙ are i.i.d. with mean μ and variance σ²:
Sₙ = X₁ + X₂ + ... + Xₙ
Then: (Sₙ - nμ) / (σ√n) → N(0, 1) as n → ∞
Or for sample mean: √n(x̄ - μ)/σ → N(0, 1)

Why it's HUGE

• Explains why Normal distribution appears everywhere!
• Justifies confidence intervals
• Foundation of hypothesis testing
• Works even if original distribution is not Normal!

Example:

Roll a die 1000 times and average → approximately Normal! Even though die roll is uniform, not Normal.

Jensen's Inequality

For convex function f:
f(E[X]) ≤ E[f(X)]
For concave function:
f(E[X]) ≥ E[f(X)]

Used in: Information theory, variational inference, optimization bounds

6. Estimation

Maximum Likelihood Estimation (MLE)

⚡ Super important for ML!

Idea: Find parameters that maximize probability of observed data

θ̂_MLE = argmax L(θ|data)
= argmax P(data|θ)
In practice, maximize log-likelihood:
θ̂_MLE = argmax log L(θ|data)

Why log?

• Product becomes sum
• Numerically stable
• Easier to differentiate

Example: Estimate p for coin flips

Data: n flips, k heads
Likelihood: L(p) = C(n,k) × pᵏ × (1-p)ⁿ⁻ᵏ
Log-likelihood: ℓ(p) = log C(n,k) + k log(p) + (n-k)log(1-p)
Take derivative, set to 0:
dℓ/dp = k/p - (n-k)/(1-p) = 0
Solve: p̂_MLE = k/n (sample proportion!)

MLE for Normal distribution:

Given data x₁, x₂, ..., xₙ ~ N(μ, σ²)
μ̂_MLE = x̄ = (1/n)Σxᵢ (sample mean)
σ̂²_MLE = (1/n)Σ(xᵢ - x̄)² (sample variance)

Maximum A Posteriori (MAP)

MLE + prior information (Bayesian approach)

θ̂_MAP = argmax P(θ|data)
= argmax P(data|θ) × P(θ) [by Bayes]
= argmax [log P(data|θ) + log P(θ)]
MLE: log P(data|θ)
MAP: log P(data|θ) + log P(θ)
↑ likelihood ↑ prior

Connection to Regularization

MAP with Gaussian prior → L2 regularization (Ridge)
MAP with Laplace prior → L1 regularization (Lasso)
Loss = -log P(data|θ) - log P(θ)
= NLL + regularization term

Bias and Variance of Estimators

Bias(θ̂) = E[θ̂] - θ
Unbiased: E[θ̂] = θ
Consistent: θ̂ → θ as n → ∞
Variance: Var(θ̂) = how much estimator varies
Mean Squared Error:
MSE(θ̂) = Bias² + Variance

Confidence Intervals

Range that contains true parameter with certain probability

95% CI: [θ̂ - 1.96·SE, θ̂ + 1.96·SE]
where SE = standard error = σ/√n

Interpretation: If we repeat experiment many times, 95% of intervals will contain true parameter.

7. Hypothesis Testing

Framework

1. Null Hypothesis (H₀): Status quo (no effect)
2. Alternative Hypothesis (H₁): What we want to show
3. Test statistic: Computed from data
4. p-value: P(observe data | H₀ is true)
5. Decision: Reject H₀ if p-value < α (significance level)

p-value

p-value = probability of observing data (or more extreme) assuming H₀ is true
Common threshold: α = 0.05
If p < 0.05 → Reject H₀ (statistically significant)
If p ≥ 0.05 → Fail to reject H₀

⚠️ Common Misconception:

❌ p-value is NOT probability that H₀ is true
✓ p-value is probability of data given H₀ is true

Type I and Type II Errors

H₀ TrueH₀ False
Reject H₀Type I Error (α)✓ Correct
Fail to Reject✓ CorrectType II Error (β)
Type I Error (False Positive): Reject H₀ when it's true - Probability = α
Type II Error (False Negative): Fail to reject H₀ when it's false - Probability = β
Power = 1 - β: Probability of correctly rejecting false H₀

Common Tests

z-test

For large samples (n > 30) with known σ

z = (x̄ - μ₀) / (σ/√n)
If |z| > 1.96 → reject H₀ at α=0.05

t-test

For small samples or unknown σ

t = (x̄ - μ₀) / (s/√n)
where s = sample standard deviation

Compare t to t-distribution with (n-1) degrees of freedom

Chi-square test

For categorical data

χ² = Σ (Observed - Expected)² / Expected

Used for: goodness-of-fit, independence testing

A/B Testing

⚡ Super practical for ML!

Goal: Compare two variants (A and B)

Setup:

• Randomly assign users to A or B
• Measure conversion rate (or other metric)
• Test if difference is statistically significant
Test: Two-sample z-test or t-test
H₀: p_A = p_B (no difference)
H₁: p_A ≠ p_B (there is difference)
Calculate:
z = (p̂_A - p̂_B) / SE
where SE = √[p̂(1-p̂)(1/n_A + 1/n_B)]
p̂ = pooled proportion

Important:

• Need sufficient sample size (power analysis)
• Beware of multiple testing
• Consider practical vs statistical significance

8. Information Theory

Entropy (Shannon Entropy)

Measure of uncertainty/information content

H(X) = -Σ P(x) log₂ P(x)
or with natural log:
H(X) = -Σ P(x) ln P(x)
Units: bits (log₂) or nats (ln)

Properties

• H(X) ≥ 0
• H(X) = 0 iff X is deterministic
• Maximum when distribution is uniform
• For uniform over n outcomes: H = log₂(n)

Examples

Fair coin: P(H)=0.5, P(T)=0.5
H = -0.5 log₂(0.5) - 0.5 log₂(0.5) = 1 bit
Certain outcome: P(A)=1
H = -1 log₂(1) = 0 bits (no uncertainty!)
Biased coin: P(H)=0.9, P(T)=0.1
H = -0.9 log₂(0.9) - 0.1 log₂(0.1) = 0.47 bits

Cross-Entropy

⚡ THE LOSS FUNCTION!

H(p, q) = -Σ p(x) log q(x)
where:
• p = true distribution
• q = predicted distribution

Used in: Classification loss (cross-entropy loss)

Binary Cross-Entropy

BCE = -[y log(ŷ) + (1-y) log(1-ŷ)]
where:
• y = true label (0 or 1)
• ŷ = predicted probability

Categorical Cross-Entropy

CCE = -Σ yᵢ log(ŷᵢ)
where:
• y = one-hot encoded true label
• ŷ = softmax output (predicted probabilities)

Example:

True: [0, 1, 0] (class 2)
Pred: [0.1, 0.7, 0.2]
CCE = -[0×log(0.1) + 1×log(0.7) + 0×log(0.2)]
= -log(0.7) = 0.36

Why Cross-Entropy for Classification?

✓ Derived from MLE for categorical distribution
✓ Penalizes confident wrong predictions heavily
✓ Gradient works well (no vanishing)
✓ Probabilistic interpretation

KL Divergence (Kullback-Leibler)

Measure of difference between two distributions

KL(p||q) = Σ p(x) log[p(x)/q(x)]
= H(p,q) - H(p)
= Cross-Entropy - Entropy

Properties

• KL(p||q) ≥ 0
• KL(p||q) = 0 iff p = q
• NOT symmetric: KL(p||q) ≠ KL(q||p)
• NOT a distance metric

Used in: VAE, RL (policy optimization), model comparison

Mutual Information

Measure of dependence between variables

I(X;Y) = H(X) + H(Y) - H(X,Y)
= KL(p(x,y) || p(x)p(y))
I(X;Y) = 0 iff X and Y are independent

Used in: Feature selection, information theory in neural networks, causality

9. Key ML Concepts

Bias-Variance Tradeoff

⚡ CRITICAL!

Total Error = Bias² + Variance + Irreducible Error

Bias

• Error from wrong assumptions
• High bias → underfitting
• Model too simple

Variance

• Error from sensitivity to data
• High variance → overfitting
• Model too complex

Examples:

• Linear regression on non-linear data → High Bias
• Deep network on small dataset → High Variance

Solutions:

• High Bias: Add features, increase model complexity
• High Variance: More data, regularization, dropout

MLE Connection to Loss Functions

Classification (Bernoulli):
MLE → minimize -log P(y|x)
Loss = Binary Cross-Entropy
Multi-class (Categorical):
MLE → minimize -log P(y|x)
Loss = Categorical Cross-Entropy
Regression (Gaussian):
MLE → minimize -log P(y|x)
Loss = Mean Squared Error (MSE)
General pattern:
Loss = -log Likelihood

Regularization as Prior

L2 Regularization = Gaussian Prior
Loss = MSE + λ||w||₂²
= -log P(data|w) - log P(w)
where P(w) ~ N(0, 1/λ)
L1 Regularization = Laplace Prior
Loss = MSE + λ||w||₁
= -log P(data|w) - log P(w)
where P(w) ~ Laplace(0, 1/λ)

Probabilistic Interpretation of Neural Networks

Softmax output = Categorical distribution over classes
P(y = k | x) = exp(zₖ) / Σexp(zᵢ)
Binary sigmoid = Bernoulli distribution
P(y = 1 | x) = σ(z) = 1/(1 + e⁻ᶻ)
Regression output = Gaussian mean
P(y | x) = N(f(x), σ²)

Sampling Techniques in ML

Monte Carlo: Average over random samples
Bootstrap: Resample with replacement (estimate uncertainty)
Cross-Validation: Split data for model evaluation
Importance Sampling: Sample from different distribution
MCMC: Sample complex distributions

10. Quick Formulas Reference

Expectation

E[aX + b] = aE[X] + b
E[X + Y] = E[X] + E[Y]
E[XY] = E[X]E[Y] (if independent)
E[X²] = Var(X) + (E[X])²

Variance

Var(aX + b) = a²Var(X)
Var(X + Y) = Var(X) + Var(Y) (if indep)
Var(X) = E[X²] - (E[X])²

Covariance

Cov(X, Y) = E[XY] - E[X]E[Y]
Cov(X, X) = Var(X)
Cov(aX, bY) = ab·Cov(X,Y)
Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)

Standard Distributions

DistributionMeanVariance
Bernoulli(p)pp(1-p)
Binomial(n,p)npnp(1-p)
Poisson(λ)λλ
Uniform(a,b)(a+b)/2(b-a)²/12
Normal(μ,σ²)μσ²
Exponential(λ)1/λ1/λ²

11. Python Implementation

# Distributions
mu, sigma = 0, 1
x = np.random.normal(mu, sigma, 1000)
binomial = np.random.binomial(n=10, p=0.5, size=1000)
poisson = np.random.poisson(lam=5, size=1000)
# PDF, CDF
from scipy.stats import norm
pdf = norm.pdf(x, mu, sigma)
cdf = norm.cdf(x, mu, sigma)
# Statistics
mean = np.mean(x)
std = np.std(x, ddof=1)
var = np.var(x, ddof=1)
cov = np.cov(x, y)
corr = np.corrcoef(x, y)
# Hypothesis Testing
t_stat, p_value = stats.ttest_1samp(x, popmean=0)
t_stat, p_value = stats.ttest_ind(group_a, group_b)
chi2, p_value = stats.chisquare(observed, expected)
# Confidence Intervals
se = stats.sem(x)
ci = stats.t.interval(0.95, len(x)-1, loc=mean, scale=se)
# Entropy & Cross-Entropy
def entropy(p):
p = p[p > 0]
return -np.sum(p * np.log2(p))
def cross_entropy(p, q):
return -np.sum(p * np.log(q))
def bce(y_true, y_pred):
return -np.mean(y_true * np.log(y_pred) +
(1 - y_true) * np.log(1 - y_pred))
# A/B Testing
def ab_test(conv_a, total_a, conv_b, total_b):
p_a = conv_a / total_a
p_b = conv_b / total_b
p_pool = (conv_a + conv_b) / (total_a + total_b)
se = np.sqrt(p_pool * (1-p_pool) * (1/total_a + 1/total_b))
z = (p_a - p_b) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
return z, p_value
# Bootstrap
def bootstrap(data, n_iter=1000, func=np.mean):
results = []
for _ in range(n_iter):
sample = np.random.choice(data, size=len(data), replace=True)
results.append(func(sample))
return np.array(results)

12. Summary - Must-Know Concepts

Top 10 Concepts (Memorize These!)

1. Bayes' Theorem: P(A|B) = P(B|A)P(A)/P(B)
2. Expected Value: E[X] = Σx·P(x)
3. Variance: Var(X) = E[X²] - (E[X])²
4. Normal Distribution: N(μ, σ²), 68-95-99.7 rule
5. MLE: argmax log P(data|θ)
6. Cross-Entropy: -Σy log(ŷ)
7. Central Limit Theorem: Sum → Normal
8. Bias-Variance: Total Error = Bias² + Variance + ε
9. p-value: P(data | H₀ true)
10. Entropy: H(X) = -Σp log(p)

Key Connections to ML

Bayes → Bayesian ML, priors, posteriors
MLE → Loss functions (cross-entropy, MSE)
MAP → Regularization (L1, L2)
Gaussian → Weight init, noise modeling
Bernoulli/Categorical → Classification
Cross-Entropy → Classification loss
KL Divergence → VAE, RL
Bias-Variance → Model selection
Hypothesis Testing → A/B testing
Sampling → Monte Carlo, bootstrap

Practice Problems

1. If P(A) = 0.3, P(B|A) = 0.7, what is P(A ∩ B)?
2. X ~ N(10, 4). What is P(X < 12)?
3. Calculate entropy of a fair 6-sided die
4. Given y = [0,1,0], ŷ = [0.2, 0.6, 0.2], find cross-entropy
5. Var(3X + 5) = ? if Var(X) = 4
Show Answers
1. P(A ∩ B) = P(B|A)×P(A) = 0.7×0.3 = 0.21
2. z = (12-10)/2 = 1, P(Z<1) ≈ 0.8413
3. H = -Σ(1/6)log₂(1/6) = log₂(6) ≈ 2.58 bits
4. CE = -[0×log(0.2) + 1×log(0.6) + 0×log(0.2)] = -log(0.6) ≈ 0.51
5. Var(3X + 5) = 3²×Var(X) = 9×4 = 36

Pro Tips

For ML/DL

Focus on Bayes, MLE, Cross-Entropy, and distributions - they're the foundation of modern ML.

For Interviews

Be able to explain CLT, bias-variance tradeoff, and why we use cross-entropy for classification.

Practice

Work through probability problems daily. Understand the intuition, not just formulas.