Linear Algebra
FOUNDATIONS

Linear Algebra for Machine Learning

Complete reference covering vectors, matrices, eigenvalues, and SVD - everything you need for ML/DL.

1. Vectors

A vector is an ordered list of numbers representing a point in space or a direction.

v = [v₁, v₂, v₃, ..., vₙ]

Types of Vectors

TypeNotationShapeExample
Column Vectorv(n, 1)[[1], [2], [3]]
Row Vectorvᵀ(1, n)[[1, 2, 3]]
Zero Vector0(n, 1)[[0], [0], [0]]
Unit Vectore(n, 1)[[1], [0], [0]]

Vector Operations

Addition: u + v = [u₁+v₁, u₂+v₂, ..., uₙ+vₙ]
Scalar Multiplication: c·v = [c·v₁, c·v₂, ..., c·vₙ]
Dot Product: u·v = u₁v₁ + u₂v₂ + ... + uₙvₙ = Σ uᵢvᵢ

Example: Dot Product

u = [1, 2, 3]
v = [4, 5, 6]
u·v = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32

Vector Magnitude (Length)

‖v‖ = √(v₁² + v₂² + ... + vₙ²) = √(v·v)
Example:
v = [3, 4]
‖v‖ = √(3² + 4²) = √(9 + 16) = √25 = 5

Unit Vector (Normalization)

v̂ = v / ‖v‖
Example:
v = [3, 4]
v̂ = [3/5, 4/5] = [0.6, 0.8]

Key Formulas

Angle between vectors: cos(θ) = (u·v) / (‖u‖ · ‖v‖)
Orthogonal vectors: u·v = 0 → u ⊥ v

2. Matrices

A 2D array of numbers arranged in rows and columns.

A = [a₁₁ a₁₂ a₁₃]
[a₂₁ a₂₂ a₂₃]
[a₃₁ a₃₂ a₃₃]
Shape: m × n (m rows, n columns)

Special Matrices

Matrix TypeDefinitionExample
Square Matrixm = n3×3, 4×4
Diagonal MatrixAᵢⱼ = 0 if i≠j[[2,0,0], [0,3,0], [0,0,4]]
Identity Matrix (I)Diagonal with 1's[[1,0,0], [0,1,0], [0,0,1]]
Zero MatrixAll zeros[[0,0], [0,0]]
Symmetric MatrixA = Aᵀ[[1,2,3], [2,4,5], [3,5,6]]
Upper TriangularAᵢⱼ=0 if i>j[[1,2,3], [0,4,5], [0,0,6]]
Lower TriangularAᵢⱼ=0 if i<j[[1,0,0], [2,3,0], [4,5,6]]

3. Matrix Operations

Transpose

Flip rows and columns. If A is m×n, then Aᵀ is n×m

Original Matrix A

A = [1 2 3]
[4 5 6]

Transposed Aᵀ

Aᵀ = [1 4]
[2 5]
[3 6]

Properties

(Aᵀ)ᵀ = A
(A + B)ᵀ = Aᵀ + Bᵀ
(AB)ᵀ = BᵀAᵀ ← Order reversal!

Matrix Multiplication

Critical for Neural Networks!

C = A × B
Requirements:
• A is m×n
• B is n×p
• Result C is m×p
Rule: Cᵢⱼ = Σₖ AᵢₖBₖⱼ

Example

A = [1 2] B = [5 6]
[3 4] [7 8]
A×B = [1×5+2×7 1×6+2×8] = [19 22]
[3×5+4×7 3×6+4×8] [43 50]

Properties

✓ Associative: (AB)C = A(BC)
✓ Distributive: A(B+C) = AB + AC
✗ NOT Commutative: AB ≠ BA
✓ Identity: AI = IA = A

Element-wise (Hadamard) Product

Denoted by ⊙ or ∗

A ⊙ B = [a₁₁b₁₁ a₁₂b₁₂]
[a₂₁b₂₁ a₂₂b₂₂]

Used in: dropout, attention mechanisms

Matrix-Vector Multiplication

[a₁₁ a₁₂] [v₁] [a₁₁v₁ + a₁₂v₂]
[a₂₁ a₂₂] × [v₂] = [a₂₁v₁ + a₂₂v₂]
This is the foundation of neural networks: y = Wx + b

4. Special Matrices

Identity Matrix (I)

I = [1 0 0]
[0 1 0]
[0 0 1]
AI = IA = A
I is its own inverse

Inverse Matrix (A⁻¹)

A·A⁻¹ = A⁻¹·A = I

Example

A = [4 7] A⁻¹ = [ 1 -7]
[2 6] [-2 4]

Properties

(A⁻¹)⁻¹ = A
(AB)⁻¹ = B⁻¹A⁻¹ ← Order reversal!
(Aᵀ)⁻¹ = (A⁻¹)ᵀ

Inverse exists when:

• Matrix is square (n×n)
• Matrix is non-singular (det(A) ≠ 0)
• Matrix is full rank

Orthogonal Matrix (Q)

QᵀQ = QQᵀ = I
Therefore: Qᵀ = Q⁻¹

Properties

• Preserves lengths: ‖Qv‖ = ‖v‖
• Preserves angles
• Rows/columns are orthonormal vectors

Example: Rotation Matrix

Q = [cos(θ) -sin(θ)]
[sin(θ) cos(θ)]

Positive Definite Matrix

xᵀAx > 0 for all x ≠ 0

Properties

• All eigenvalues > 0
• Used in optimization (convex functions)
• Covariance matrices are positive semi-definite

5. Matrix Properties

Determinant (det(A) or |A|)

For 2×2 matrix

A = [a b]
[c d]
det(A) = ad - bc

For 3×3 matrix

A = [a b c]
[d e f]
[g h i]
det(A) = a(ei-fh) - b(di-fg) + c(dh-eg)

Properties

det(AB) = det(A)·det(B)
det(Aᵀ) = det(A)
det(A⁻¹) = 1/det(A)
det(cA) = cⁿdet(A) (for n×n matrix)
If det(A) = 0, matrix is singular (no inverse)

Geometric Meaning

• Determinant = volume scaling factor of linear transformation
• Sign indicates orientation (flip or not)

Trace (tr(A))

Sum of diagonal elements

tr(A) = a₁₁ + a₂₂ + ... + aₙₙ = Σ aᵢᵢ

Properties

tr(A + B) = tr(A) + tr(B)
tr(cA) = c·tr(A)
tr(Aᵀ) = tr(A)
tr(AB) = tr(BA) ← Cyclic property!
tr(A) = sum of eigenvalues

Rank

Maximum number of linearly independent rows (or columns)

rank(A) = r

Properties

rank(A) ≤ min(m, n) for m×n matrix
Full rank: rank(A) = min(m, n)
rank(AB) ≤ min(rank(A), rank(B))
rank(A) = rank(Aᵀ)

Interpretation

• Dimensionality of output space
• Number of independent features

6. Eigenvalues & Eigenvectors

Definition

For a square matrix A:

Av = λv
where:
• v is the eigenvector (direction that doesn't change)
• λ is the eigenvalue (scaling factor)

How to Find Eigenvalues

det(A - λI) = 0 ← Characteristic equation

Example

A = [4 1]
[2 3]
det(A - λI) = (4-λ)(3-λ) - 2 = 0
λ² - 7λ + 10 = 0
(λ - 5)(λ - 2) = 0
Eigenvalues: λ₁ = 5, λ₂ = 2

Properties

Sum of eigenvalues = tr(A)
Product of eigenvalues = det(A)
For symmetric matrices: all eigenvalues are real
Eigenvectors of different eigenvalues are orthogonal (for symmetric A)

Diagonalization

If A has n linearly independent eigenvectors:

A = PDP⁻¹
where:
• P = matrix of eigenvectors
• D = diagonal matrix of eigenvalues

Why it matters

• Powers: Aⁿ = PDⁿP⁻¹ (easy to compute!)
• Used in PCA
• Understanding matrix behavior

7. Matrix Decompositions

LU Decomposition

A = LU
L = Lower triangular
U = Upper triangular

Used for: Solving linear systems efficiently

QR Decomposition

A = QR
Q = Orthogonal matrix
R = Upper triangular

Used for: Least squares, eigenvalue algorithms

Eigendecomposition

A = PDP⁻¹
P = Eigenvector matrix
D = Diagonal eigenvalue matrix

Requirements: A must be square and have n independent eigenvectors

Used for: PCA, understanding transformations

Singular Value Decomposition (SVD)

⚡ SUPER IMPORTANT FOR ML!

A = UΣVᵀ
where:
• U = m×m orthogonal matrix (left singular vectors)
• Σ = m×n diagonal matrix (singular values)
• V = n×n orthogonal matrix (right singular vectors)
Works for ANY matrix (not just square)!

Properties

Singular values σᵢ ≥ 0
Ordered: σ₁ ≥ σ₂ ≥ ... ≥ σᵣ > 0
rank(A) = number of non-zero singular values
‖A‖₂ = σ₁ (largest singular value)

Used in

PCA (Principal Component Analysis)
Image compression
Recommender systems
Low-rank approximations
Pseudo-inverse

Cholesky Decomposition

For positive definite matrices:

A = LLᵀ
L = lower triangular

Used for: Solving linear systems, sampling from multivariate Gaussians

8. Norms

Vector Norms

Measure of vector "size" or "length"

NormFormulaNameUse Case
L0# of non-zero elementsL0-normSparsity
L1Σ|vᵢ|ManhattanSparsity, robustness
L2√(Σvᵢ²)EuclideanMost common
L∞max|vᵢ|Max normWorst-case

Example

v = [3, -4]
L0 = 2 (two non-zero elements)
L1 = |3| + |-4| = 7
L2 = √(3² + 4²) = √25 = 5
L∞ = max(|3|, |-4|) = 4

Matrix Norms

Frobenius Norm

‖A‖_F = √(Σᵢⱼ aᵢⱼ²)
Example:
A = [1 2]
[3 4]
‖A‖_F = √(1²+2²+3²+4²) = √30

Spectral Norm (L2)

‖A‖₂ = σ_max
(largest singular value)

Regularization in ML

L1 Regularization (Lasso)

Loss = MSE + λ·‖w‖₁

Encourages sparsity (many weights = 0)

L2 Regularization (Ridge)

Loss = MSE + λ·‖w‖₂²

Encourages small weights (weight decay)

9. Linear Transformations

A function T that satisfies:

T(au + bv) = aT(u) + bT(v)

Every linear transformation can be represented as a matrix!

Common Transformations

Scaling

S = [sₓ 0 ]
[0 sᵧ]

Scales x by sₓ, y by sᵧ

Rotation (2D)

R(θ) = [cos(θ) -sin(θ)]
[sin(θ) cos(θ)]

Rotates counterclockwise

Shear

H = [1 k]
[0 1]

Shears horizontally

Reflection

Rₓ = [1 0] (x-axis)
[0 -1]
Rᵧ = [-1 0] (y-axis)
[0 1]

Projection

Project vector v onto vector u:

proj_u(v) = (v·u / u·u)·u
Matrix form (for unit vector u):
P = uuᵀ

10. Key Concepts for ML/DL

1. Linear Systems (Ax = b)

Ax = b

Solutions

• Unique solution: A is invertible → x = A⁻¹b
• No solution: b not in column space of A
• Infinite solutions: A is rank-deficient

2. Least Squares

Minimize ‖Ax - b‖₂²

Normal Equation:
x = (AᵀA)⁻¹Aᵀb

This is how linear regression works!

3. Moore-Penrose Pseudo-Inverse (A⁺)

For non-square or singular matrices:

A⁺ = (AᵀA)⁻¹Aᵀ (when A has full column rank)
Using SVD: A = UΣVᵀ
A⁺ = VΣ⁺Uᵀ
where Σ⁺ has reciprocals of non-zero singular values

Used in

• Linear regression
• Neural network weight initialization
• Solving underdetermined systems

4. Principal Component Analysis (PCA)

Steps

1. Center data: X_centered = X - mean(X)
2. Compute covariance: C = (1/n)XᵀX
3. Eigen-decomposition: C = PDP⁻¹
4. Sort eigenvalues: λ₁ ≥ λ₂ ≥ ... ≥ λₙ
5. Keep top k eigenvectors
6. Project: X_reduced = X @ P[:, :k]

Why it works:

• Eigenvectors point in directions of maximum variance
• Eigenvalues tell you how much variance

5. Covariance Matrix

Cov(X) = (1/n)XᵀX (for centered data)

Properties

• Symmetric: Cov = Covᵀ
• Positive semi-definite
• Diagonal = variances
• Off-diagonal = covariances

6. Gradient of Matrix Operations

Critical for backpropagation!

FunctionGradient
f(x) = Ax∇f = Aᵀ
f(x) = xᵀAx∇f = (A + Aᵀ)x
f(x) = ‖Ax - b‖²∇f = 2Aᵀ(Ax - b)

7. Matrix Calculus Identities

∂(Wx)/∂W = xᵀ
∂(Wx)/∂x = Wᵀ
∂(xᵀWx)/∂x = (W + Wᵀ)x
∂tr(AB)/∂A = Bᵀ

8. Batch Matrix Operations in Neural Networks

Input: X (batch_size × input_dim)
Weight: W (input_dim × output_dim)
Bias: b (output_dim,)
Forward: Y = XW + b

Gradients

∂L/∂W = Xᵀ @ ∂L/∂Y
∂L/∂X = ∂L/∂Y @ Wᵀ
∂L/∂b = sum(∂L/∂Y, axis=0)

9. Orthogonality in Neural Networks

Why orthogonal matrices are nice

• Preserve gradient magnitudes (no vanishing/exploding)
• Efficient to invert: Qᵀ = Q⁻¹
• Used in: RNNs, initialization

Orthogonal Initialization

W = np.linalg.qr(np.random.randn(n, m))[0]

10. Low-Rank Approximation

Using SVD for compression:

A = UΣVᵀ
A_k = U[:, :k] @ Σ[:k, :k] @ V[:, :k].T
This is the best rank-k approximation of A!

Used in

• Model compression
• Recommender systems (matrix factorization)
• Image compression

Quick Reference for Neural Networks

Forward Pass

# Single layer
Z = X @ W + b # Linear transformation
A = activation(Z) # Non-linearity
# Multi-layer
A1 = σ(X @ W1 + b1)
A2 = σ(A1 @ W2 + b2)
Y = A2 @ W3 + b3

Backward Pass (Chain Rule + Linear Algebra)

# Output layer
dL_dY = Y - Y_true
# Hidden layer
dL_dW3 = A2.T @ dL_dY
dL_db3 = sum(dL_dY, axis=0)
dL_dA2 = dL_dY @ W3.T
# Previous layer
dL_dZ2 = dL_dA2 * σ'(Z2)
dL_dW2 = A1.T @ dL_dZ2
...

Common Matrix Dimensions in DL

Fully Connected Layer

Input: (batch_size, input_dim)
Weight: (input_dim, output_dim)
Bias: (output_dim,)
Output: (batch_size, output_dim)

Convolutional Layer (simplified)

Input: (batch, channels_in, height, width)
Kernel: (channels_out, channels_in, k_h, k_w)
Output: (batch, channels_out, new_h, new_w)

Attention

Q: (batch, seq_len, d_model)
K: (batch, seq_len, d_model)
V: (batch, seq_len, d_model)
Attention: softmax(QKᵀ/√d_k) @ V

Most Important for Interviews

Top 10 Must-Know Concepts

1. Matrix multiplication (dimensions, order)
2. Transpose properties
3. Inverse (when it exists, properties)
4. Eigenvalues/Eigenvectors (intuition)
5. SVD (what it is, why it's useful)
6. Norms (L1, L2, Frobenius)
7. Orthogonal matrices
8. Rank (what it means)
9. Dot product / Inner product
10. Linear systems (Ax = b)

Practice Problems

1. Multiply: [1 2] × [5 6]
[3 4] [7 8]
2. Find inverse: [2 1]
[5 3]
3. Compute: ‖[3, 4, 12]‖₂
4. Find eigenvalues: [3 1]
[1 3]
5. What's rank of: [1 2 3]
[2 4 6]
Show Answers
1. [19 22]
[43 50]
2. [ 3 -1]
[-5 2]
3. 13
4. λ₁=4, λ₂=2
5. rank=1 (rows are dependent)

Python Implementation

# Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
# Multiplication
C = A @ B # or np.dot(A, B)
# Transpose
At = A.T
# Inverse
A_inv = np.linalg.inv(A)
# Determinant
det_A = np.linalg.det(A)
# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
# SVD
U, S, Vt = np.linalg.svd(A)
# Norms
l2_norm = np.linalg.norm(A) # Frobenius by default
l2_norm_vec = np.linalg.norm(v, 2)
l1_norm = np.linalg.norm(v, 1)
# Solve linear system Ax = b
x = np.linalg.solve(A, b)
# Pseudo-inverse
A_pinv = np.linalg.pinv(A)
# Rank
rank = np.linalg.matrix_rank(A)
# Trace
trace = np.trace(A)

Core Formulas - TL;DR

Matrix Multiplication: C_ij = Σ_k A_ik B_kj
Transpose: (AB)ᵀ = BᵀAᵀ
Inverse: AA⁻¹ = I
Determinant: |AB| = |A||B|
Eigenvalue: Av = λv
SVD: A = UΣVᵀ
Norm: ‖v‖₂ = √(Σvᵢ²)
Least Squares: x = (AᵀA)⁻¹Aᵀb
Gradient: ∂(Wx)/∂W = xᵀ

Summary

You now have everything you need for:

Machine Learning interviews
Deep Learning implementation
Understanding research papers
Debugging neural networks
Implementing algorithms from scratch

Focus on:

1. Matrix multiplication (shapes, order)
2. Transpose and inverse properties
3. Eigenvalues/SVD intuition
4. How linear algebra connects to neural networks