Linear Algebra for Machine Learning - Complete Cheat Sheet

1. Vectors

A vector is an ordered list of numbers representing a point in space or a direction.

v = [v₁, v₂, v₃, ..., vₙ]

Types of Vectors

Type	Notation	Shape	Example
Column Vector	v	(n, 1)	[[1], [2], [3]]
Row Vector	vᵀ	(1, n)	[[1, 2, 3]]
Zero Vector	0	(n, 1)	[[0], [0], [0]]
Unit Vector	e	(n, 1)	[[1], [0], [0]]

Vector Operations

Addition: u + v = [u₁+v₁, u₂+v₂, ..., uₙ+vₙ]

Scalar Multiplication: c·v = [c·v₁, c·v₂, ..., c·vₙ]

Dot Product: u·v = u₁v₁ + u₂v₂ + ... + uₙvₙ = Σ uᵢvᵢ

Example: Dot Product

u = [1, 2, 3]

v = [4, 5, 6]

u·v = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32

Vector Magnitude (Length)

‖v‖ = √(v₁² + v₂² + ... + vₙ²) = √(v·v)

Example:

v = [3, 4]

‖v‖ = √(3² + 4²) = √(9 + 16) = √25 = 5

Unit Vector (Normalization)

v̂ = v / ‖v‖

Example:

v = [3, 4]

v̂ = [3/5, 4/5] = [0.6, 0.8]

Key Formulas

Angle between vectors: cos(θ) = (u·v) / (‖u‖ · ‖v‖)

Orthogonal vectors: u·v = 0 → u ⊥ v

2. Matrices

A 2D array of numbers arranged in rows and columns.

A = [a₁₁ a₁₂ a₁₃]

[a₂₁ a₂₂ a₂₃]

[a₃₁ a₃₂ a₃₃]

Shape: m × n (m rows, n columns)

Special Matrices

Matrix Type	Definition	Example
Square Matrix	m = n	3×3, 4×4
Diagonal Matrix	Aᵢⱼ = 0 if i≠j	[[2,0,0], [0,3,0], [0,0,4]]
Identity Matrix (I)	Diagonal with 1's	[[1,0,0], [0,1,0], [0,0,1]]
Zero Matrix	All zeros	[[0,0], [0,0]]
Symmetric Matrix	A = Aᵀ	[[1,2,3], [2,4,5], [3,5,6]]
Upper Triangular	Aᵢⱼ=0 if i>j	[[1,2,3], [0,4,5], [0,0,6]]
Lower Triangular	Aᵢⱼ=0 if i<j	[[1,0,0], [2,3,0], [4,5,6]]

3. Matrix Operations

Transpose

Flip rows and columns. If A is m×n, then Aᵀ is n×m

Original Matrix A

A = [1 2 3]

[4 5 6]

Transposed Aᵀ

Aᵀ = [1 4]

[2 5]

[3 6]

Properties

(Aᵀ)ᵀ = A

(A + B)ᵀ = Aᵀ + Bᵀ

(AB)ᵀ = BᵀAᵀ ← Order reversal!

Matrix Multiplication

⚡ CRITICAL FOR NEURAL NETWORKS

C = A × B

Requirements:

• A is m×n

• B is n×p

• Result C is m×p

Rule: Cᵢⱼ = Σₖ AᵢₖBₖⱼ

Example

A = [1 2] B = [5 6]

[3 4] [7 8]

A×B = [1×5+2×7 1×6+2×8] = [19 22]

[3×5+4×7 3×6+4×8] [43 50]

Properties

✓ Associative: (AB)C = A(BC)

✓ Distributive: A(B+C) = AB + AC

✗ NOT Commutative: AB ≠ BA

✓ Identity: AI = IA = A

Element-wise (Hadamard) Product

Denoted by ⊙ or ∗

A ⊙ B = [a₁₁b₁₁ a₁₂b₁₂]

[a₂₁b₂₁ a₂₂b₂₂]

Used in: dropout, attention mechanisms

Matrix-Vector Multiplication

[a₁₁ a₁₂] [v₁] [a₁₁v₁ + a₁₂v₂]

[a₂₁ a₂₂] × [v₂] = [a₂₁v₁ + a₂₂v₂]

This is the foundation of neural networks: y = Wx + b

4. Special Matrices

Identity Matrix (I)

I = [1 0 0]

[0 1 0]

[0 0 1]

AI = IA = A

I is its own inverse

Inverse Matrix (A⁻¹)

A·A⁻¹ = A⁻¹·A = I

Example

A = [4 7] A⁻¹ = [ 3 -7]

[2 6] [-2 4]

Properties

(A⁻¹)⁻¹ = A

(AB)⁻¹ = B⁻¹A⁻¹ ← Order reversal!

(Aᵀ)⁻¹ = (A⁻¹)ᵀ

Inverse exists when:

• Matrix is square (n×n)

• Matrix is non-singular (det(A) ≠ 0)

• Matrix is full rank

Orthogonal Matrix (Q)

QᵀQ = QQᵀ = I

Therefore: Qᵀ = Q⁻¹

Properties

• Preserves lengths: ‖Qv‖ = ‖v‖

• Preserves angles

• Rows/columns are orthonormal vectors

Example: Rotation Matrix

Q = [cos(θ) -sin(θ)]

[sin(θ) cos(θ)]

Positive Definite Matrix

xᵀAx > 0 for all x ≠ 0

Properties

• All eigenvalues > 0

• Used in optimization (convex functions)

• Covariance matrices are positive semi-definite

5. Matrix Properties

Determinant (det(A) or |A|)

For 2×2 matrix

A = [a b]

[c d]

det(A) = ad - bc

For 3×3 matrix

A = [a b c]

[d e f]

[g h i]

det(A) = a(ei-fh) - b(di-fg) + c(dh-eg)

Properties

det(AB) = det(A)·det(B)

det(Aᵀ) = det(A)

det(A⁻¹) = 1/det(A)

det(cA) = cⁿdet(A) (for n×n matrix)

If det(A) = 0, matrix is singular (no inverse)

Geometric Meaning

• Determinant = volume scaling factor of linear transformation

• Sign indicates orientation (flip or not)

Trace (tr(A))

Sum of diagonal elements

tr(A) = a₁₁ + a₂₂ + ... + aₙₙ = Σ aᵢᵢ

Properties

tr(A + B) = tr(A) + tr(B)

tr(cA) = c·tr(A)

tr(Aᵀ) = tr(A)

tr(AB) = tr(BA) ← Cyclic property!

tr(A) = sum of eigenvalues

Rank

Maximum number of linearly independent rows (or columns)

rank(A) = r

Properties

rank(A) ≤ min(m, n) for m×n matrix

Full rank: rank(A) = min(m, n)

rank(AB) ≤ min(rank(A), rank(B))

rank(A) = rank(Aᵀ)

Interpretation

• Dimensionality of output space

• Number of independent features

6. Eigenvalues & Eigenvectors

Definition

For a square matrix A:

Av = λv

where:

• v is the eigenvector (direction that doesn't change)

• λ is the eigenvalue (scaling factor)

How to Find Eigenvalues

det(A - λI) = 0 ← Characteristic equation

Example

A = [4 1]

[2 3]

det(A - λI) = (4-λ)(3-λ) - 2 = 0

λ² - 7λ + 10 = 0

(λ - 5)(λ - 2) = 0

Eigenvalues: λ₁ = 5, λ₂ = 2

Properties

Sum of eigenvalues = tr(A)

Product of eigenvalues = det(A)

For symmetric matrices: all eigenvalues are real

Eigenvectors of different eigenvalues are orthogonal (for symmetric A)

Diagonalization

If A has n linearly independent eigenvectors:

A = PDP⁻¹

where:

• P = matrix of eigenvectors

• D = diagonal matrix of eigenvalues

Why it matters

• Powers: Aⁿ = PDⁿP⁻¹ (easy to compute!)

• Used in PCA

• Understanding matrix behavior

7. Matrix Decompositions

LU Decomposition

A = LU

L = Lower triangular

U = Upper triangular

Used for: Solving linear systems efficiently

QR Decomposition

A = QR

Q = Orthogonal matrix

R = Upper triangular

Used for: Least squares, eigenvalue algorithms

Eigendecomposition

A = PDP⁻¹

P = Eigenvector matrix

D = Diagonal eigenvalue matrix

Requirements: A must be square and have n independent eigenvectors

Used for: PCA, understanding transformations

Singular Value Decomposition (SVD)

⚡ SUPER IMPORTANT FOR ML!

A = UΣVᵀ

where:

• U = m×m orthogonal matrix (left singular vectors)

• Σ = m×n diagonal matrix (singular values)

• V = n×n orthogonal matrix (right singular vectors)

Works for ANY matrix (not just square)!

Properties

Singular values σᵢ ≥ 0

Ordered: σ₁ ≥ σ₂ ≥ ... ≥ σᵣ > 0

rank(A) = number of non-zero singular values

‖A‖₂ = σ₁ (largest singular value)

Used in

PCA (Principal Component Analysis)

Image compression

Recommender systems

Low-rank approximations

Pseudo-inverse

Cholesky Decomposition

For positive definite matrices:

A = LLᵀ

L = lower triangular

Used for: Solving linear systems, sampling from multivariate Gaussians

8. Norms

Vector Norms

Measure of vector "size" or "length"

Norm	Formula	Name	Use Case
L0	# of non-zero elements	L0-norm	Sparsity
L1	Σ\|vᵢ\|	Manhattan	Sparsity, robustness
L2	√(Σvᵢ²)	Euclidean	Most common
L∞	max\|vᵢ\|	Max norm	Worst-case

Example

v = [3, -4]

L0 = 2 (two non-zero elements)

L1 = |3| + |-4| = 7

L2 = √(3² + 4²) = √25 = 5

L∞ = max(|3|, |-4|) = 4

Matrix Norms

Frobenius Norm

‖A‖_F = √(Σᵢⱼ aᵢⱼ²)

Example:

A = [1 2]

[3 4]

‖A‖_F = √(1²+2²+3²+4²) = √30

Spectral Norm (L2)

‖A‖₂ = σ_max

(largest singular value)

Regularization in ML

L1 Regularization (Lasso)

Loss = MSE + λ·‖w‖₁

Encourages sparsity (many weights = 0)

L2 Regularization (Ridge)

Loss = MSE + λ·‖w‖₂²

Encourages small weights (weight decay)

9. Linear Transformations

A function T that satisfies:

T(au + bv) = aT(u) + bT(v)

Every linear transformation can be represented as a matrix!

Common Transformations

Scaling

S = [sₓ 0 ]

[0 sᵧ]

Scales x by sₓ, y by sᵧ

Rotation (2D)

R(θ) = [cos(θ) -sin(θ)]

[sin(θ) cos(θ)]

Rotates counterclockwise

Shear

H = [1 k]

[0 1]

Shears horizontally

Reflection

Rₓ = [1 0] (x-axis)

[0 -1]

Rᵧ = [-1 0] (y-axis)

[0 1]

Projection

Project vector v onto vector u:

proj_u(v) = (v·u / u·u)·u

Matrix form (for unit vector u):

P = uuᵀ

10. Key Concepts for ML/DL

1. Linear Systems (Ax = b)

Ax = b

Solutions

• Unique solution: A is invertible → x = A⁻¹b

• No solution: b not in column space of A

• Infinite solutions: A is rank-deficient

2. Least Squares

Minimize ‖Ax - b‖₂²

Normal Equation:

x = (AᵀA)⁻¹Aᵀb

This is how linear regression works!

3. Moore-Penrose Pseudo-Inverse (A⁺)

For non-square or singular matrices:

A⁺ = (AᵀA)⁻¹Aᵀ (when A has full column rank)

Using SVD: A = UΣVᵀ

A⁺ = VΣ⁺Uᵀ

where Σ⁺ has reciprocals of non-zero singular values

Used in

• Linear regression

• Neural network weight initialization

• Solving underdetermined systems

4. Principal Component Analysis (PCA)

Steps

1. Center data: X_centered = X - mean(X)

2. Compute covariance: C = (1/n)XᵀX

3. Eigen-decomposition: C = PDP⁻¹

4. Sort eigenvalues: λ₁ ≥ λ₂ ≥ ... ≥ λₙ

5. Keep top k eigenvectors

6. Project: X_reduced = X @ P[:, :k]

Why it works:

• Eigenvectors point in directions of maximum variance

• Eigenvalues tell you how much variance

5. Covariance Matrix

Cov(X) = (1/n)XᵀX (for centered data)

Properties

• Symmetric: Cov = Covᵀ

• Positive semi-definite

• Diagonal = variances

• Off-diagonal = covariances

6. Gradient of Matrix Operations

Critical for backpropagation!

Function	Gradient
f(x) = Ax	∇f = Aᵀ
f(x) = xᵀAx	∇f = (A + Aᵀ)x
f(x) = ‖Ax - b‖²	∇f = 2Aᵀ(Ax - b)

7. Matrix Calculus Identities

∂(Wx)/∂W = xᵀ

∂(Wx)/∂x = Wᵀ

∂(xᵀWx)/∂x = (W + Wᵀ)x

∂tr(AB)/∂A = Bᵀ

8. Batch Matrix Operations in Neural Networks

Input: X (batch_size × input_dim)

Weight: W (input_dim × output_dim)

Bias: b (output_dim,)

Forward: Y = XW + b

Gradients

∂L/∂W = Xᵀ @ ∂L/∂Y

∂L/∂X = ∂L/∂Y @ Wᵀ

∂L/∂b = sum(∂L/∂Y, axis=0)

9. Orthogonality in Neural Networks

Why orthogonal matrices are nice

• Preserve gradient magnitudes (no vanishing/exploding)

• Efficient to invert: Qᵀ = Q⁻¹

• Used in: RNNs, initialization

Orthogonal Initialization

W = np.linalg.qr(np.random.randn(n, m))[0]

10. Low-Rank Approximation

Using SVD for compression:

A = UΣVᵀ

A_k = U[:, :k] @ Σ[:k, :k] @ V[:, :k].T

This is the best rank-k approximation of A!

Used in

• Model compression

• Recommender systems (matrix factorization)

• Image compression

Quick Reference for Neural Networks

Forward Pass

# Single layer

Z = X @ W + b # Linear transformation

A = activation(Z) # Non-linearity

# Multi-layer

A1 = σ(X @ W1 + b1)

A2 = σ(A1 @ W2 + b2)

Y = A2 @ W3 + b3

Backward Pass (Chain Rule + Linear Algebra)

# Output layer

dL_dY = Y - Y_true

# Hidden layer

dL_dW3 = A2.T @ dL_dY

dL_db3 = sum(dL_dY, axis=0)

dL_dA2 = dL_dY @ W3.T

# Previous layer

dL_dZ2 = dL_dA2 * σ'(Z2)

dL_dW2 = A1.T @ dL_dZ2

...

Common Matrix Dimensions in DL

Fully Connected Layer

Input: (batch_size, input_dim)

Weight: (input_dim, output_dim)

Bias: (output_dim,)

Output: (batch_size, output_dim)

Convolutional Layer (simplified)

Input: (batch, channels_in, height, width)

Kernel: (channels_out, channels_in, k_h, k_w)

Output: (batch, channels_out, new_h, new_w)

Attention

Q: (batch, seq_len, d_model)

K: (batch, seq_len, d_model)

V: (batch, seq_len, d_model)

Attention: softmax(QKᵀ/√d_k) @ V

Most Important for Interviews

Top 10 Must-Know Concepts

1. Matrix multiplication (dimensions, order)

2. Transpose properties

3. Inverse (when it exists, properties)

4. Eigenvalues/Eigenvectors (intuition)

5. SVD (what it is, why it's useful)

6. Norms (L1, L2, Frobenius)

7. Orthogonal matrices

8. Rank (what it means)

9. Dot product / Inner product

10. Linear systems (Ax = b)

Practice Problems

1. Multiply: [1 2] × [5 6]

[3 4] [7 8]

2. Find inverse: [2 1]

[5 3]

3. Compute: ‖[3, 4, 12]‖₂

4. Find eigenvalues: [3 1]

[1 3]

5. What's rank of: [1 2 3]

[2 4 6]

Show Answers

1. [19 22]

[43 50]

2. [ 3 -1]

[-5 2]

3. 13

4. λ₁=4, λ₂=2

5. rank=1 (rows are dependent)

Python Implementation

# Matrix operations

A = np.array([[1, 2], [3, 4]])

B = np.array([[5, 6], [7, 8]])

# Multiplication

C = A @ B # or np.dot(A, B)

# Transpose

At = A.T

# Inverse

A_inv = np.linalg.inv(A)

# Determinant

det_A = np.linalg.det(A)

# Eigenvalues and eigenvectors

eigenvalues, eigenvectors = np.linalg.eig(A)

# SVD

U, S, Vt = np.linalg.svd(A)

# Norms

l2_norm = np.linalg.norm(A) # Frobenius by default

l2_norm_vec = np.linalg.norm(v, 2)

l1_norm = np.linalg.norm(v, 1)

# Solve linear system Ax = b

x = np.linalg.solve(A, b)

# Pseudo-inverse

A_pinv = np.linalg.pinv(A)

# Rank

rank = np.linalg.matrix_rank(A)

# Trace

trace = np.trace(A)

Core Formulas - TL;DR

Matrix Multiplication: C_ij = Σ_k A_ik B_kj

Transpose: (AB)ᵀ = BᵀAᵀ

Inverse: AA⁻¹ = I

Determinant: |AB| = |A||B|

Eigenvalue: Av = λv

SVD: A = UΣVᵀ

Norm: ‖v‖₂ = √(Σvᵢ²)

Least Squares: x = (AᵀA)⁻¹Aᵀb

Gradient: ∂(Wx)/∂W = xᵀ

Summary

You now have everything you need for:

Machine Learning interviews

Deep Learning implementation

Understanding research papers

Debugging neural networks

Implementing algorithms from scratch

Focus on:

1. Matrix multiplication (shapes, order)

2. Transpose and inverse properties

3. Eigenvalues/SVD intuition

4. How linear algebra connects to neural networks

Quick Tips

✓ Always check matrix dimensions before multiplying

✓ Remember: (AB)ᵀ = BᵀAᵀ (order reverses!)

✓ For eigenvalues: det(A - λI) = 0

✓ Symmetric matrices have real eigenvalues

✓ SVD works for ANY matrix (not just square)

✓ Orthogonal matrices: Qᵀ = Q⁻¹