DBMS Part 5: Recovery, Query Optimization & Complete Reference

Core Topics

1. Failure Types
2. Log-Based Recovery
3. UNDO & REDO
4. Checkpointing
5. Query Optimization

Reference Materials

6. Complete Formula Sheet
7. Decision Trees
8. Common Mistakes
9. Problem Patterns
10. Final Checklist

Backup & Recovery

1. Failure Types

Failure Type	Cause	Data Loss	Recovery Method
Transaction	Logic error, abort	Volatile only	Rollback using log
System Crash	Power failure	RAM lost	Redo/Undo using log
Disk Failure	Hardware failure	Permanent loss	Restore from backup

2. Log-Based Recovery

Log Record Format

<T_i, start> → Transaction starts

<T_i, X, old_value, new_value> → Update record

<T_i, commit> → Transaction commits

<T_i, abort> → Transaction aborts

Example Log

<T1, start>
<T1, A, 1000, 950>    -- A changed from 1000 to 950
<T1, B, 2000, 2050>   -- B changed from 2000 to 2050
<T1, commit>

Write-Ahead Logging (WAL)

Rule 1: Log record must be written BEFORE database update
Rule 2: All log records of transaction must be written BEFORE commit
Ensures: Can undo incomplete transactions & redo committed transactions

3. UNDO & REDO Operations

UNDO

Purpose: Reverse effects of incomplete transaction

When: Transaction started but not committed before crash

Log: <T1, A, 1000, 950>
Recovery: A = 1000 (restore old value)

REDO

Purpose: Reapply effects of committed transaction

When: Transaction committed but changes not written to disk

Log: <T1, A, 1000, 950>
Recovery: A = 950 (reapply new value)

Recovery Algorithm

Step 1: Scan log backward
- • Find transactions that committed (have <T_i, commit>)
- • Find transactions that didn't commit (no <T_i, commit>)
Step 2: UNDO Phase (backward)
- • For uncommitted transactions
- • Restore old_values
Step 3: REDO Phase (forward)
- • For committed transactions
- • Reapply new_values

4. Checkpointing

Purpose: Reduce recovery time by marking a point where all updates are written to disk

Process:

1. Stop accepting new transactions
2. Complete all active transactions
3. Write all buffers to disk
4. Write <checkpoint> to log
5. Resume normal operation

Benefits

• Recovery starts from last checkpoint
• No need to scan entire log
• Only process transactions active at or after checkpoint

Query Optimization

5. Query Processing & Optimization

Cost Estimation

Primary Cost: Number of disk block accesses

Secondary: CPU time (usually ignored)

Selection Cost

Method	Cost	When to Use
Linear Search	b_r	No index, small table
Binary Search	⌈log₂(b_r)⌉	Sorted file
Primary Index	height + 1	Index on key
Secondary Index	height + matches	Index on non-key

Join Algorithms

Join Type	Cost	Best When
Nested-Loop	b_r + (n_r × b_s)	Small tables
Block Nested-Loop	b_r + (b_r × b_s)	Medium tables
Hash Join	≈ 3(b_r + b_s)	Large tables, equi-join
Sort-Merge	sort + (b_r + b_s)	Sorted data

Optimization Techniques

1. Selection Push-Down

BAD: σ_condition(R ⋈ S) | GOOD: σ_condition(R) ⋈ S

2. Projection Push-Down

Project early to reduce data size

3. Join Ordering

Start with join that produces smallest result

6. Complete Formula Reference Sheet

Disk & Storage

Total Disk Access Time

Time = Seek Time + Rotational Latency + Transfer Time

Transfer Time

Transfer Time = Data Size / Data Transfer Rate

Average Rotational Latency

Latency = 1 / (2 × RPM) × 60 seconds

MTBF

MTBF(array) = MTBF(one disk) / Number of disks

Indexing

B-Tree Order

key_size × (p-1) + pointer_size × p ≤ block_size

B-Tree Height

height = ⌈log_p(n)⌉

Max/Min Keys per Node

max = p-1, min = ⌈(p-1)/2⌉

Bitmap Index Size

size = (rows × distinct_values) / 8 bytes

Functional Dependencies

Number of Super Keys

count = 2^(n-k) where n=total attrs, k=candidate key size

Lossless Decomposition Test

R1 ∩ R2 → R1 OR R1 ∩ R2 → R2

Query Optimization Costs

Linear Search

Cost = b_r (blocks in relation)

Nested-Loop Join

Cost = b_r + (n_r × b_s)

Hash Join

Cost ≈ 3(b_r + b_s)

Transactions

Deadlock-Free Resources

min = Σ(max_requirement) - (n - 1)

Schedule Hierarchy

Strict ⊂ Cascadeless ⊂ Recoverable

7. Decision Trees

Which SQL Query to Use?

Finding specific records? → SELECT ... WHERE

Counting/Aggregating?
  ├─ Overall count → COUNT(*)
  └─ Per group → COUNT(*) + GROUP BY

Combining results?
  ├─ All from both → UNION
  ├─ Common only → INTERSECT
  └─ In A not B → EXCEPT

Matching pattern? → LIKE '%pattern%'

Data from multiple tables?
  ├─ Same column names → NATURAL JOIN or USING
  ├─ Different columns → JOIN ... ON
  └─ Same table → Self-join with aliases

"Has NOT done..." → EXCEPT or NOT EXISTS

"More than some" → > ANY (subquery)

"More than all" → > ALL (subquery)

Complex multi-step? → WITH (CTE)

Which Normal Form?

All values atomic? ──NO→ ❌ Not 1NF
        YES
        │
Single attribute key? ──YES→ Skip to 3NF check
        NO
        │
Partial dependency exists?
  ├─ YES → ❌ Not 2NF
  └─ NO → ✅ In 2NF
        │
Transitive dependency exists?
  ├─ YES → ❌ Not 3NF
  └─ NO → Check each FD α→β:
           - α not superkey AND
           - β not prime?
      ├─ YES → ❌ Not 3NF
      └─ NO → ✅ In 3NF
        │
Every FD has LHS = superkey?
  ├─ YES → ✅ In BCNF
  └─ NO → ❌ Not BCNF (but in 3NF)

Is Schedule Serializable?

Draw precedence graph
  (Edge Ti→Tj if conflicting ops where Ti first)
        │
        ├─ Cycle exists?
        │   ├─ YES → ❌ NOT conflict serializable
        │   └─ NO → ✅ Conflict serializable
        │
        └─ Check for blind writes
            ├─ Blind write exists → ❌ NOT view serializable
            └─ No blind writes → May be view serializable

8. Common Mistakes & Traps

❌ Using = NULL instead of IS NULL

WHERE phone IS NULL (correct)

❌ Confusing WHERE and HAVING

WHERE filters rows, HAVING filters groups

❌ Thinking 3NF = BCNF

3NF allows prime→prime, BCNF doesn't

❌ Wrong decomposition test

Test R1 ∩ R2 → R1 or R2 (not R1 ∪ R2)

❌ Sparse index on unordered file

Sparse ONLY works on ordered files

❌ Thinking 2PL prevents deadlock

2PL guarantees serializability, NOT deadlock-free

❌ Missing candidate keys

Check all attribute combinations systematically

❌ Wrong B-tree order formula

key_size × (p-1) + pointer × p ≤ block

❌ Confusing recoverable and cascadeless

All cascadeless → recoverable (not reverse)

❌ Always using index for queries

Use index only for <10% of rows

9. Assignment Question Patterns

Keywords in Question	Solution Approach
"Find names where..."	SELECT name FROM ... WHERE
"Count/Total number"	COUNT() + GROUP BY
"Has NOT done..."	EXCEPT or NOT EXISTS
"Maximum/Minimum in each"	Subquery with GROUP BY + MAX/MIN
"Pattern matching"	LIKE '%pattern%'
"Find order of B-tree"	key_size×(p-1) + ptr×p ≤ block
"Is schedule serializable?"	Draw precedence graph, check cycles
"Find candidate keys"	Attrs not on RHS, compute closure
"Highest normal form?"	Check partial, transitive, BCNF rules
"Lossless decomposition?"	R1 ∩ R2 → R1 or R2?