How Spark Structured Streaming Recovers After Failures

Playback speed

Share post at current time

0:00

Transcript

How Spark Structured Streaming Recovers After Failures

A deep dive into fault tolerance, checkpointing, and exactly-once semantics with Delta Lake

Canadian Data Guy

Oct 03, 2025

Introduction

When a Spark Structured Streaming job fails mid-flight, how does it know where to resume? What prevents duplicate writes to your Delta tables? This article explores the elegant mechanisms that make Spark Structured Streaming fault-tolerant and exactly-once.

Key Insight:
The checkpoint directory and Delta Lake’s transaction log work together to ensure correctness even when clusters die between writing data and recording completion.

Checkpoint Directory Structure

When you start a streaming query, Spark creates a checkpoint directory with the following structure:

Critical Timing:
offsets/N is written before processing batch N starts.
commits/N is written after batch N completes successfully.

The Normal Happy Path

The Critical Failure Scenario

Here’s where things get interesting. What happens when the cluster dies after writing to Delta but before writing the commit file?

The Problem
Batch N+1 was successfully written to Delta Lake, but the commit file was never created. On restart, Spark will see:
offsets/N+1 exists
commits/N+1 does not exist
Data is already in the Delta table
Question: Won’t re-running batch N+1 create duplicates? 🤔

Delta Lake’s Idempotency Magic

This is where Delta Lake’s transaction log saves the day. Delta records two critical pieces of metadata with every streaming write:

Note: The terms epochId and batchId refer almost to the same thing - the monotonically increasing micro-batch number. I am trying to find more details to figure out the difference

txnAppId (Query ID) The unique streaming query identifier from metadata/ .This ensures different queries don’t interfere with each other

txnVersion (Epoch ID) The micro-batch number (0, 1, 2, 3...) Monotonically increasing; each batch gets its own ID

The Solution

When Spark retries batch N+1, Delta Lake checks its transaction log:
Has transaction (queryId: “abc-def-123-456”, epochId: N+1) been committed?
If YES: Skip the duplicate write, create commits/N+1
If NO: Proceed with the write, then create commits/N+1

Complete Recovery Flow

Key Insight: The checkpoint and Delta transaction log work together as a distributed two-phase commit. The checkpoint tracks intent, while Delta’s log tracks completion. Both must agree for the system to move forward.

Offset Semantics / ( inclusive, exclusive]

When reading from Kafka, understanding offset boundaries is crucial for reasoning about what each batch consumes.

Start Offset: Inclusive- If start = 100, offset 100 is included in the batch

End Offset: Exclusive- If end = 200, offset 200 is not included in the batch

Next Batch: Batch N+1 would start at offset 200(the previous end becomes the next start)

TLDR: Recovery Rules

Where to Find the IDs

Streaming Query ID

📁Checkpoint metadata/ file
🖥️Spark Streaming UI (while query runs)
📓Notebook outputs showing query status

Batch/Epoch IDs

📊Tracked per micro-batch (0, 1, 2, ...)
🏷️Used by Delta to prevent duplicate commits (same as epochId)
📜Visible in Delta transaction log

When Things Go Wrong: A Real-World Accident

I encountered a scenario where duplicate records appeared in my Delta Lake table despite Structured Streaming’s exactly-once guarantees. The quickest way to identify the scope of the problem was using Delta’s _metadata column to pinpoint which Parquet files contained duplicates. By tracing these files back through the Delta transaction log versions, I discovered the root cause: the same epochId appeared in multiple transactions with different queryId values. This broke Delta Lake’s idempotency mechanism, which relies on the unique combination of (queryId, epochId) to detect and skip duplicate writes.

Root Cause

The checkpoint directory was accidentally overwritten or corrupted, causing Spark to reinitialize with a new queryId while replaying already-processed batches. Since Delta only saw new queryId values, it treated these as legitimate new transactions rather than duplicates, resulting in data duplication.

Key Lessons:

This incident brought to light the critical importance of:

Enable comprehensive logging Implement S3 Server Access Logging or AWS CloudTrail to audit all checkpoint location and detect unauthorized changes
Implement strict access control on checkpoint directories to prevent accidental modifications or deletions using UC Volumes
Treat checkpoint directories as critical infrastructure requiring the same level of protection and operational discipline as your data itself

Critical Takeaway: While Spark and Delta Lake provide strong exactly-once semantics, the checkpoint directory is a critical piece of infrastructure that requires the same level of protection and monitoring as your data itself.

Summary: The Complete Picture

Checkpoint structure: offsets/tracks what to process, commits/ tracks what’s been completed
Timing matters: Offsets written before processing, commits written after success
Delta Lake’s role: Transaction log with (query ID, epoch ID) prevents duplicates
Safe replay: If a batch is replayed, Delta checks for prior commit and skips if found
Exactly-once guarantee: Together, checkpoint + Delta transaction log ensure no data loss or duplication

Related Deep Dives:

If you found this troubleshooting walkthrough helpful, I have a couple of other related posts that dive deeper into Delta Lake forensics and data management:

Forensic Analysis: I’ve written a detailed guide on how to trace exactly which Delta version number, Parquet file, and commit produced specific records, including sample code for the investigation process. If there’s interest, I’m happy to write a dedicated breakdown on this methodology—just leave a comment below!
Handling Data Deletion: When you need to delete data from Delta Lake tables, understanding the impact on downstream streaming consumers is critical. I’ve covered this scenario in depth, including patterns for safe deletion and stream recovery. Check it out here: How to Actually Delete Data in Spark
📺 Deep Dive into Stateful Stream Processing in Structured Streaming
This talk covers the internals of stateful stream processing, checkpoint mechanisms, and recovery patterns in production environments. It provides invaluable insights into how Spark manages state and handles failures at scale.