1. Probability Basics
Core Concepts
P(A) = favorable outcomes / total outcomes
Axioms
Conditional Probability
Read as: "Probability of A given B"
Bayes' Theorem
⚡ THE MOST IMPORTANT!
In ML terms
Classic Example: Disease Testing
Independence
Events A and B are independent if:
Chain Rule of Probability
Used in: Autoregressive models, language models
2. Random Variables
Types
Type | Definition | Example |
---|---|---|
Discrete | Countable outcomes | {1,2,3,4,5,6} |
Continuous | Uncountable outcomes | [0, ∞) |
Probability Mass Function (PMF) - Discrete
Properties
Example: Fair die
P(X = k) = 1/6 for k ∈ {1,2,3,4,5,6}
Probability Density Function (PDF) - Continuous
Properties
Cumulative Distribution Function (CDF)
Properties
Expected Value (Mean)
Properties
Variance
Properties
Example: Fair die X ∈ {1,2,3,4,5,6}
3. Common Distributions
Discrete Distributions
Bernoulli Distribution
Single trial with success/failure
Used in: Binary classification, coin flips
Binomial Distribution
Number of successes in n independent Bernoulli trials
Used in: Number of heads in n coin flips
Poisson Distribution
Number of events in fixed time interval
Used in: Website visitors per hour, system failures
Categorical Distribution
Generalization of Bernoulli to k outcomes
Used in: Multi-class classification (softmax output!)
Continuous Distributions
Uniform Distribution
All values equally likely in [a, b]
Used in: Random initialization, sampling
Normal/Gaussian Distribution
⚡ MOST IMPORTANT!
Properties
Why it's everywhere
Exponential Distribution
Time until event occurs
Used in: Time between events, survival analysis
Beta Distribution
Distribution over probabilities [0, 1]
Used in: Bayesian priors for probabilities, A/B testing
4. Multivariate Distributions
Joint Probability
Marginal Probability
Get distribution of one variable from joint
Conditional Distribution
Covariance
Measure of how two variables vary together
Properties
Correlation
Normalized covariance (scale-free)
Multivariate Gaussian
⚡ Critical for ML!
Properties
Used in
Covariance Matrix:
Σᵢⱼ = Cov(Xᵢ, Xⱼ)
Diagonal = variances, Off-diagonal = covariances
5. Key Theorems
Law of Large Numbers (LLN)
Why it matters
Central Limit Theorem (CLT)
⚡ ONE OF THE MOST IMPORTANT!
Sum of many independent random variables → Normal distribution
Why it's HUGE
Example:
Roll a die 1000 times and average → approximately Normal! Even though die roll is uniform, not Normal.
Jensen's Inequality
Used in: Information theory, variational inference, optimization bounds
6. Estimation
Maximum Likelihood Estimation (MLE)
⚡ Super important for ML!
Idea: Find parameters that maximize probability of observed data
Why log?
Example: Estimate p for coin flips
MLE for Normal distribution:
Maximum A Posteriori (MAP)
MLE + prior information (Bayesian approach)
Connection to Regularization
Bias and Variance of Estimators
Confidence Intervals
Range that contains true parameter with certain probability
Interpretation: If we repeat experiment many times, 95% of intervals will contain true parameter.
7. Hypothesis Testing
Framework
p-value
⚠️ Common Misconception:
Type I and Type II Errors
H₀ True | H₀ False | |
---|---|---|
Reject H₀ | Type I Error (α) | ✓ Correct |
Fail to Reject | ✓ Correct | Type II Error (β) |
Common Tests
z-test
For large samples (n > 30) with known σ
t-test
For small samples or unknown σ
Compare t to t-distribution with (n-1) degrees of freedom
Chi-square test
For categorical data
Used for: goodness-of-fit, independence testing
A/B Testing
⚡ Super practical for ML!
Goal: Compare two variants (A and B)
Setup:
Important:
8. Information Theory
Entropy (Shannon Entropy)
Measure of uncertainty/information content
Properties
Examples
Cross-Entropy
⚡ THE LOSS FUNCTION!
Used in: Classification loss (cross-entropy loss)
Binary Cross-Entropy
Categorical Cross-Entropy
Example:
Why Cross-Entropy for Classification?
KL Divergence (Kullback-Leibler)
Measure of difference between two distributions
Properties
Used in: VAE, RL (policy optimization), model comparison
Mutual Information
Measure of dependence between variables
Used in: Feature selection, information theory in neural networks, causality
9. Key ML Concepts
Bias-Variance Tradeoff
⚡ CRITICAL!
Bias
Variance
Examples:
Solutions:
MLE Connection to Loss Functions
Regularization as Prior
Probabilistic Interpretation of Neural Networks
Sampling Techniques in ML
10. Quick Formulas Reference
Expectation
Variance
Covariance
Standard Distributions
Distribution | Mean | Variance |
---|---|---|
Bernoulli(p) | p | p(1-p) |
Binomial(n,p) | np | np(1-p) |
Poisson(λ) | λ | λ |
Uniform(a,b) | (a+b)/2 | (b-a)²/12 |
Normal(μ,σ²) | μ | σ² |
Exponential(λ) | 1/λ | 1/λ² |
11. Python Implementation
12. Summary - Must-Know Concepts
Top 10 Concepts (Memorize These!)
Key Connections to ML
Practice Problems
Show Answers
Pro Tips
For ML/DL
Focus on Bayes, MLE, Cross-Entropy, and distributions - they're the foundation of modern ML.
For Interviews
Be able to explain CLT, bias-variance tradeoff, and why we use cross-entropy for classification.
Practice
Work through probability problems daily. Understand the intuition, not just formulas.