Lecture 20 - FS Reliability

Crash Consistency: Ensure file system is recoverable after crash

Consistent State: Failure atomicity. Either look like nothing happened, or operations completed.

Approaches

UPS battery for a clean shutdown
fsck: Do nothing, try to repair afterwards
- Detect crash using a clean-unmount flag
- Scan entire fs for consistency rules
- Problem: Very slow, cannot fix data corruption
Journaling: Treat fs operations as transactions (allow rollback / redo)/green
- Record writes as they happen (write data → write meta journal → commit journal)
- Know exactly what to do after a crash
- ext3: Store journal data as regular large file (for backward compatibility)

Journaling Issues

Stale Metadata: File data overwritten during meta playback if they occupy the same inode. (mkdir → rmdir → make file that overwrites the dir → crash)
- Solution: Revoke record in journal
Journal Corruption: Bit flip in journal data. Ext4 solution: checksums.

Superblocks: Restore from another dup if corrupted
Free blocks: Scan inodes, build in-memory bitmap, compare with fs bitmap
Inode state: Check inode fields for corruption (if corrupted, remove inode)
Inode links: Verify link count by traversing directory tree (if orphaned, move to lost+found)
Duplicates: Check if two inodes refer to the same block (make copy of the block)
Bad blocks: Bad pointers outside of valid range (remove)
Directory checks: Make sure . and .. exist & directories are linked only once (prevent cycle)