Canadian Data Guy Unfiltered: Deep Dive

Stop Waiting for Connectors: Stream ANYTHING into Spark (It's 4 Functions)

Canadian Data Guy — Mon, 03 Nov 2025 17:24:45 GMT

💡 What You’ll Learn

By the end of this guide, you’ll understand that building a custom Spark streaming source isn’t rocket science. It’s actually a well-defined conversation between Spark and your code, with just 5 key methods to implement. We’ll use a real Ethereum blockchain streaming example to show you exactly how it works.

The Problem: You Have Data, Spark Wants It

You’ve got data streaming in from somewhere unique — maybe it’s IoT sensors, a blockchain, a custom message queue, or an internal database. You want to process it with Spark’s powerful distributed engine, but there’s no pre-built connector. What do you do?

The good news: You can build your own custom source. The even better news: It’s simpler than you think.

Real-World Use Case: In this guide, we’ll walk through streaming Ethereum blockchain data into Spark. The same principles apply to any data source — from proprietary APIs to custom databases. The pattern is universal.

The Secret: It’s Just a Conversation

Think of building a custom Spark streaming source as a conversation between two specialists:

The Two Characters in Our Story

Spark’s job (the Project Manager) is to handle all the complex distributed computing stuff: checkpointing, fault tolerance, distributing work across a cluster, and guaranteeing exactly-once processing semantics.

Your code’s job (the Data Specialist) is much simpler: answer Spark’s questions about where your data is, how to access it, and how to break it into chunks that can be processed in parallel.

🎯 Key Insight: You don’t need to understand distributed systems, fault tolerance algorithms, or checkpoint mechanisms. You just need to implement 5 simple methods that answer Spark’s questions about your data source.

The 5 Questions Spark Will Ask You

Spark’s conversation with your code follows a predictable pattern. It asks 5 questions, and you provide straightforward answers. Let’s look at each one:

Let’s See Real Code: Streaming Ethereum Blocks

Theory is great, but let’s look at actual implementation. Here’s how these 5 methods work in practice for streaming Ethereum blockchain data:

1. initialOffset() — Setting the Starting Point

def initialOffset(self) -> dict:
    “”“
    Called ONCE when starting a brand new query.
    Return where to begin reading.
    “”“
    start_block = self.options.get(”start_block”, 0)
    return {”offset”: int(start_block)}

That’s it! Just return a dictionary with your starting position. Spark saves this and uses it as the baseline for the entire query lifecycle.

2. latestOffset() — Checking What’s Available

def latestOffset(self) -> dict:
    “”“
    Called at the START of every batch.
    Connect to your source and return the newest available data.
    “”“
    latest_block = self.w3.eth.block_number
    return {”offset”: int(latest_block)}

This method connects to your data source (in this case, an Ethereum node) and asks “what’s the latest?” The answer defines the upper bound for the current batch.

⚠️ Python API Limitation: In PySpark, latestOffset() must return the absolute latest data point. If you’re backfilling from very old data, your first batch could be huge. The Scala API offers more fine-grained control here, but for most real-time use cases, the Python API works perfectly.

📝 Note: This limitation is actively being addressed - there’s currently a pull request in progress to fix this in Spark.

3. partitions() — Dividing the Work

def partitions(self, start: dict, end: dict) -> list:
    “”“
    Spark gives you a range (start → end).
    You break it into smaller chunks for parallel processing.
    “”“
    start_block = start[”offset”]
    end_block = end[”offset”]  # This is EXCLUSIVE (not included)
    
    num_partitions = self.spark.conf.get(”spark.sql.shuffle.partitions”, “4”)
    blocks_per_partition = (end_block - start_block) // int(num_partitions)
    
    partitions = []
    for i in range(int(num_partitions)):
        partition_start = start_block + (i * blocks_per_partition)
        partition_end = partition_start + blocks_per_partition
        if i == int(num_partitions) - 1:  # Last partition gets any remainder
            partition_end = end_block
            
        partitions.append(BlockRangePartition(partition_start, partition_end))
    
    return partitions

How Partitioning Works

🔑 Critical Detail: Notice that the end block (1100) is exclusive. This means partition ranges are [1000, 1025), [1025, 1050), etc. Block 1100 is NOT processed—it becomes the start of the next batch. This [start, end) pattern is how Spark guarantees no data is ever processed twice.

4 read() — Actually Fetching the Data

def read(self, partition: BlockRangePartition):
    “”“
    This runs on EXECUTOR nodes (distributed across the cluster).
    Each executor gets one partition and must fetch its assigned data.
    
    Must be DETERMINISTIC - same input = same output, every time.
    This allows Spark to safely retry failed tasks.
    “”“
    for block_number in range(partition.start_block, partition.end_block):
        # Connect to Ethereum and fetch this specific block
        block = self.w3.eth.get_block(block_number, full_transactions=True)
        
        # Convert to Spark Row format
        yield Row(
            block_number=block.number,
            block_hash=block.hash.hex(),
            timestamp=block.timestamp,
            transaction_count=len(block.transactions),
            # ... more fields ...
        )

This is where the real work happens! Each executor in your cluster runs this method for its assigned partition, fetching the actual data.

💪 The Power of Parallelism: If you have 10 executors and create 100 partitions, all 10 executors work simultaneously. Each one processes its chunk, and as executors finish, Spark automatically assigns them new partitions. This is how Spark achieves massive throughput.

5 commit() — Cleanup (Usually Empty)

def commit(self, end: dict):
    “”“
    Called AFTER all partitions successfully complete.
    The checkpoint/commit/{N} file gets created at this point.
    This method is optional - mainly used for cleanup tasks.
    “”“
    pass  # Usually empty unless you need cleanup

In most cases, this method is empty. The checkpoint/commit/{N} file gets created automatically. You only need to implement this if you have cleanup tasks to perform after a batch completes.

The Complete Flow: Visual Walkthrough

Now let’s see how these methods work together in a complete streaming query:

Why This Design Is Brilliant

🛡️ Fault Tolerance

If an executor fails while reading blocks 1025-1050, Spark simply restarts that task on another machine. Because read() is deterministic, it fetches exactly the same data again. The user never knows a failure occurred.

⚡ Exactly-Once Semantics

The [start, end) exclusive range pattern means no block is ever processed twice. Block 1100 is the start of the next batch, not the end of the previous one. Combined with checkpointing, this guarantees exactly-once processing.

🚀 Massive Parallelism

By implementing partitions(), you tell Spark how to break work into chunks. Spark handles distributing those chunks to hundreds or thousands of executors. You get massive scale “for free.”

🧩 Separation of Concerns

You focus on your data source’s logic. Spark handles scheduling, distribution, checkpointing, fault recovery, and coordination. Clean boundaries make complex systems manageable.

What About Edge Cases?

Handling Source Failures

What if Ethereum node goes down during read()?

def read(self, partition: BlockRangePartition):
    max_retries = 3
    for block_number in range(partition.start_block, partition.end_block):
        for attempt in range(max_retries):
            try:
                block = self.w3.eth.get_block(block_number, full_transactions=True)
                yield Row(...)
                break  # Success!
            except Exception as e:
                if attempt == max_retries - 1:
                    raise  # Let Spark handle the failure
                time.sleep(2 ** attempt)  # Exponential backoff

If retries don’t work, the exception bubbles up, Spark marks the task as failed, and restarts it on another executor. Eventually the source recovers and processing continues from the checkpoint.

Dealing with Large Batches

What if latestOffset() returns a huge number?

The Golden Rule: Your processing rate should be greater than your input rate. Ideally, aim for 10x faster processing than data arrival. This is the key design principle.

If you’re processing data faster than it’s arriving, Spark will naturally catch up with any backfill over the next few batches. You don’t need to worry about temporarily large batch sizes.

About spark.sql.shuffle.partitions: You can adjust this, but don’t set it to an extremely high number. A reasonable partition count is sufficient as long as your processing rate exceeds your input rate.

Ensuring Determinism in read()

The golden rule: Same partition input must produce same output.

Bad (non-deterministic):

# ❌ DON’T DO THIS
def read(self, partition):
    current_time = time.time()  # Different each time!
    yield Row(timestamp=current_time, ...)

Good (deterministic):

# ✅ DO THIS
def read(self, partition):
    block = self.w3.eth.get_block(partition.block_number)
    yield Row(timestamp=block.timestamp, ...)  # Block timestamp is consistent

The Complete Picture: Architecture

🎯 You’re Ready to Build Your Own!

You now understand the complete lifecycle of a custom Spark streaming source. It’s not magic—it’s a well-designed conversation between Spark and your code.
Just implement 5 methods, and Spark handles the rest: fault tolerance, distribution, checkpointing, and exactly-once semantics.

Quick Reference: The 5 Methods

Your Implementation Checklist

Final Thoughts: Why This Matters

The beauty of this architecture is its universality. Whether you’re streaming from Ethereum, MongoDB, a proprietary API, or carrier pigeons 🐦, the pattern is the same:

Define where to start (initialOffset)
Check what’s new (latestOffset)
Break work into chunks (partitions)
Fetch the data (read)
Confirm completion (commit)

Spark handles everything else—checkpointing, distribution, scheduling, fault recovery. You just focus on the specifics of your data source.

🚀 Take Action

The barrier to entry is lower than you thought. Pick a data source you’re working with, implement these 5 methods, and you’ll have a production-ready Spark streaming source in an afternoon.

Start small: Get initialOffset() and latestOffset() working first. Then add partitions() and read(). Test with a single partition before scaling up. You’ve got this! 💪

Now go build something amazing with Spark Streaming. The data world is your oyster. 🌊

Download the code

How to write your first Spark application with Stream-Stream Joins with working code

Canadian Data Guy — Wed, 15 Oct 2025 17:39:41 GMT

Have you been waiting to try Streaming but cannot take the plunge?

In a single blog, we will teach you whatever needs to be understood about Streaming Joins. We will give you a working code which you can use for your next Streaming Pipeline.

The steps involved:

Create a fake dataset at scale
Set a baseline using traditional SQL
Define Temporary Streaming Views
Inner Joins with optional Watermarking
Left Joins with Watermarking
The cold start edge case: withEventTimeOrder
Cleanup

What is Stream-Stream Join?

Stream-stream join is a widely used operation in stream processing where two or more data streams are joined based on some common attributes or keys. It is essential in several use cases, such as real-time analytics, fraud detection, and IoT data processing.

Concept of Stream-Stream Join

Stream-stream join combines two or more streams based on a common attribute or key. The join operation is performed on an ongoing basis, with each new data item from the stream triggering a join operation. In stream-stream join, each data item in the stream is treated as an event, and it is matched with the corresponding event from the other stream based on matching criteria. This matching criterion could be a common attribute or key in both streams.

When it comes to joining data streams, there are a few key challenges that must be addressed to ensure successful results. One of the biggest hurdles is the fact that, at any given moment, neither stream has a complete view of the dataset. This can make it difficult to find matches between inputs and generate accurate join results.

To overcome this challenge, it’s important to buffer past input as a streaming state for both input streams. This allows for every future input to be matched with past input, which can help to generate more accurate join results. Additionally, this buffering process can help to automatically handle late or out-of-order data, which can be common in streaming environments.

To further optimize the join process, it’s also important to use watermarks to limit the state. This can help to ensure that only the most relevant data is being used to generate join results, which can help to improve accuracy and reduce processing times.

Types of Stream-Stream Join

Depending on the nature of the join and the matching criteria, there are several types of stream-stream join operations. Some of the popular types of stream-stream join are:

Inner Join
In inner join, only those events are returned where there is a match in both the input streams. This type of join is useful when combining the data from two streams with a common key or attribute.

Outer Join
In outer join, all events from both the input streams are included in the joined stream, whether or not there is a match between them. This type of join is useful when we need to combine data from two streams, and there may be missing or incomplete data in either stream.

Left Join
In left join, all events from the left input stream are included in the joined stream, and only the matching events from the right input stream are included. This type of join is useful when we need to combine data from two streams and keep all the data from the left stream, even if there is no matching data in the right stream.

1. The Setup: Create a fake dataset at scale

Most people do not have 2 streams just hanging around for one to experiment with Stream Steam Joins. Thus I used Faker to mock 2 different streams which we will use for this example.

The name of the library being used is Faker and faker_vehicle to create Datasets.

!pip install faker_vehicle
!pip install faker

Imports

from faker import Faker
from faker_vehicle import VehicleProvider
from pyspark.sql import functions as F
import uuid
from utils import logger

Parameters

# define schema name and where should the table be stored
schema_name = “test_streaming_joins”
schema_storage_location = “/tmp/CHOOSE_A_PERMANENT_LOCATION/”

Create the Target Schema/Database
Create a Schema and set location. This way, all tables would inherit the base location.

create_schema_sql = f”””
 CREATE SCHEMA IF NOT EXISTS {schema_name}
 COMMENT ‘This is {schema_name} schema’
 LOCATION ‘{schema_storage_location}’
 WITH DBPROPERTIES ( Owner=’Jitesh’);
 “””
print(f”create_schema_sql: {create_schema_sql}”)
spark.sql(create_schema_sql)

Use Faker to define functions to help generate fake column values

fake = Faker()
fake.add_provider(VehicleProvider)

event_id = F.udf(lambda: str(uuid.uuid4()))
vehicle_year_make_model = F.udf(fake.vehicle_year_make_model)
vehicle_year_make_model_cat = F.udf(fake.vehicle_year_make_model_cat)
vehicle_make_model = F.udf(fake.vehicle_make_model)
vehicle_make = F.udf(fake.vehicle_make)
vehicle_model = F.udf(fake.vehicle_model)
vehicle_year = F.udf(fake.vehicle_year)
vehicle_category = F.udf(fake.vehicle_category)
vehicle_object = F.udf(fake.vehicle_object)

latitude = F.udf(fake.latitude)
longitude = F.udf(fake.longitude)
location_on_land = F.udf(fake.location_on_land)
local_latlng = F.udf(fake.local_latlng)
zipcode = F.udf(fake.zipcode)

Generate Streaming source data at your desired rate

def generated_vehicle_and_geo_df (rowsPerSecond:int , numPartitions :int ):
    return (
        spark.readStream.format(”rate”)
        .option(”numPartitions”, numPartitions)
        .option(”rowsPerSecond”, rowsPerSecond)
        .load()
        .withColumn(”event_id”, event_id())
        .withColumn(”vehicle_year_make_model”, vehicle_year_make_model())
        .withColumn(”vehicle_year_make_model_cat”, vehicle_year_make_model_cat())
        .withColumn(”vehicle_make_model”, vehicle_make_model())
        .withColumn(”vehicle_make”, vehicle_make())
        .withColumn(”vehicle_year”, vehicle_year())
        .withColumn(”vehicle_category”, vehicle_category())
        .withColumn(”vehicle_object”, vehicle_object())
        .withColumn(”latitude”, latitude())
        .withColumn(”longitude”, longitude())
        .withColumn(”location_on_land”, location_on_land())
        .withColumn(”local_latlng”, local_latlng())
        .withColumn(”zipcode”, zipcode())
        )

# You can uncomment the below display command to check if the code in this cell works
#display(generated_vehicle_and_geo_df)

# You can uncomment the below display command to check if the code in this cell works
#display(generated_vehicle_and_geo_df)

Now let’s generate the base source table and let’s call it Vehicle_Geo

def stream_write_to_vehicle_geo_table(rowsPerSecond: int = 1000, numPartitions: int = 10):
    table_name_vehicle_geo= “vehicle_geo”
    (
        generated_vehicle_and_geo_df(rowsPerSecond, numPartitions)
            .writeStream
            .queryName(f”write_to_delta_table: {table_name_vehicle_geo}”)
            .option(”checkpointLocation”, f”{schema_storage_location}/{table_name_vehicle_geo}/_checkpoint”)
            .format(”delta”)
            .toTable(f”{schema_name}.{table_name_vehicle_geo}”)
    )
stream_write_to_vehicle_geo_table(rowsPerSecond = 1000, numPartitions = 10)

Let the above code run for a few iterations, and you can play with rowsPerSecond and numPartitions to control how much data you would like to generate. Once you have generated enough data, kill the above stream and get a base line for row count.

spark.read.table(f”{schema_name}.{table_name_vehicle_geo}”).count()

display(
    spark.sql(f”“”
    SELECT * 
    FROM {schema_name}.{table_name_vehicle_geo}
“”“)
)

Let’s also get a min & max of the timestamp column as we would be leveraging it for watermarking.

display(
    spark.sql(f”“”
    SELECT 
         min(timestamp)
        ,max(timestamp)
        ,current_timestamp()
    FROM {schema_name}.{table_name_vehicle_geo}
“”“)
)

Next, we will break this Delta table into 2 different tables

Because for Stream-Stream Joins we need 2 different streams. We will use Delta To Delta Streaming here to create these tables.

a ) Table: Vehicle

vehicle_df = (
        spark.readStream.format(”delta”).option(”maxFilesPerTrigger”,”100”).table(f”{schema_name}.vehicle_geo”)
        .selectExpr(
            “event_id”
            ,”timestamp as vehicle_timestamp”
            ,”vehicle_year_make_model”
            ,”vehicle_year_make_model_cat”
            ,”vehicle_make_model”
            ,”vehicle_make”
            ,”vehicle_year”
            ,”vehicle_category”
            ,”vehicle_object”
            )
    )
#display(vehicle_df)
def stream_write_to_vehicle_table():
    table_name_vehicle = “vehicle”
    (   vehicle_df
        .writeStream
        #.trigger(availableNow=True)
        .queryName(f”write_to_delta_table: {table_name_vehicle}”)
        .option(”checkpointLocation”, f”{schema_storage_location}/{table_name_vehicle}/_checkpoint”)
        .format(”delta”)
        .toTable(f”{schema_name}.{table_name_vehicle}”)
    )

stream_write_to_vehicle_table()

b) Table: Geo

We have added a filter when we write to this table. This would be useful when we emulate the left join scenario. Filter: where(”value like ‘1%’ “)

geo_df = (
    spark.readStream.format(”delta”).option(”maxFilesPerTrigger”,”100”).table(f”{schema_name}.vehicle_geo”)
        .selectExpr(
            “event_id”
            ,”value”
            ,”timestamp as geo_timestamp”
            ,”latitude”
            ,”longitude”
            ,”location_on_land”
            ,”local_latlng”
            ,”cast( zipcode as integer) as zipcode”
        ).where(”value like ‘1%’ “) 
    )
#geo_df.printSchema()
#display(geo_df)

def stream_write_to_geo_table():
    table_name_geo = “geo”
    (   geo_df
        .writeStream
        #.trigger(availableNow=True)
        .queryName(f”write_to_delta_table: {table_name_geo}”)
        .option(”checkpointLocation”, f”{schema_storage_location}/{table_name_geo}/_checkpoint”)
        .format(”delta”)
        .toTable(f”{schema_name}.{table_name_geo}”)
    )
    
stream_write_to_geo_table()

2. Set a baseline using traditional SQL

Before we do the actual streaming joins. Let’s do a regular join and figure out the expected row count.

Get row count from Inner Join

sql_query_batch_inner_join = f’‘’
        SELECT count(vehicle.event_id) as row_count_for_inner_join
        FROM {schema_name}.{table_name_vehicle} vehicle
        JOIN {schema_name}.{table_name_geo} geo
        ON vehicle.event_id = geo.event_id
    AND vehicle_timestamp >= geo_timestamp  - INTERVAL 5 MINUTES        
        ‘’‘
print(f’‘’ Run SQL Query: 
          {sql_query_batch_inner_join}       
       ‘’‘)
display( spark.sql(sql_query_batch_inner_join) )

Get row count from Inner Join

sql_query_batch_left_join = f’‘’
        SELECT count(vehicle.event_id) as row_count_for_left_join
        FROM {schema_name}.{table_name_vehicle} vehicle
        LEFT JOIN {schema_name}.{table_name_geo} geo
        ON vehicle.event_id = geo.event_id
            -- Assume there is a business logic that timestamp cannot be more than 15 minutes off
    AND vehicle_timestamp >= geo_timestamp  - INTERVAL 5 MINUTES
        ‘’‘
print(f’‘’ Run SQL Query: 
          {sql_query_batch_left_join}       
       ‘’‘)
display( spark.sql(sql_query_batch_left_join) )

Summary so far:

We created a Source Delta Table: vehicle_geo
We took the previous table and divided its column into two tables: Vehicle and Geo
Vehicle row count matches with vehicle_geo, and it has a subset of those columns
The Geo row count is lesser than Vehicle because we added a filter when we wrote to the Geo table
We ran 2 SQL to identify what the row count should be after we do stream-stream join

3. Define Temporary Streaming Views

Some people prefer to write the logic in SQL. Thus, we are creating streaming views which could be manipulated with SQL. The below code block will help create a view and set a watermark on the stream.

def stream_from_delta_and_create_view (schema_name: str, table_name:str, column_to_watermark_on:str, how_late_can_the_data_be: str = “2 minutes” , maxFilesPerTrigger: int = 100):
    view_name = f”_streaming_vw_{schema_name}_{table_name}”
    print(f”Table {schema_name}.{table_name} is now streaming under a temporoary view called {view_name}”)
    (
        spark.readStream.format(”delta”)
        .option(”maxFilesPerTrigger”, f”{maxFilesPerTrigger}”)
        .option(”withEventTimeOrder”, “true”)
        .table(f”{schema_name}.{table_name}”)
        .withWatermark(f”{column_to_watermark_on}”,how_late_can_the_data_be)
        .createOrReplaceTempView(view_name)
    )

3. a Create Vehicle Stream

Get CanadianDataGuy.com’s stories in your inbox

Join Medium for free to get updates from this writer.

Let’s create a Vehicle Stream and set its watermark as 1mins

stream_from_delta_and_create_view(schema_name =schema_name, table_name = ‘vehicle’, column_to_watermark_on =”vehicle_timestamp”, how_late_can_the_data_be = “1 minutes” )

Let’s visualize the stream.

display(
    spark.sql(f’‘’
        SELECT *
        FROM _streaming_vw_test_streaming_joins_vehicle
    ‘’‘)
)

You can also do an aggregation on the stream. It’s out of the scope of this blog, but I wanted to show you how you can do it

display(
    spark.sql(f’‘’
        SELECT 
            vehicle_make
            ,count(1) as row_count
        FROM _streaming_vw_test_streaming_joins_vehicle
        GROUP BY vehicle_make
        ORDER BY vehicle_make
    ‘’‘)
)

3. b Create Geo Stream

Let’s create a Geo Stream and set its watermark as 2 mins

stream_from_delta_and_create_view(schema_name =schema_name, table_name = ‘geo’, column_to_watermark_on =”geo_timestamp”, how_late_can_the_data_be = “2 minutes” )

Have a look at what the data looks like

display(
    spark.sql(f’‘’
        SELECT *
        FROM _streaming_vw_test_streaming_joins_geo
    ‘’‘)
)

4. Inner Joins with optional Watermarking

While inner joins on any kind of columns and with any kind of conditions are possible in streaming environments, it’s important to be aware of the potential for unbounded state growth. As new input arrives, it can potentially match with any input from the past, leading to a rapidly increasing streaming state size.

To avoid this issue, it’s essential to define additional join conditions that prevent indefinitely old inputs from matching with future inputs. By doing so, it’s possible to clear old inputs from the state, which can help to prevent unbounded state growth and ensure more efficient processing.

There are a variety of techniques that can be used to define these additional join conditions. For example, you might limit the scope of the join by only matching on a subset of columns, or you might set a time-based constraint that prevents old inputs from being considered after a certain period of time has elapsed.

Ultimately, the key to managing streaming state size and ensuring efficient join processing is to consider the unique requirements of your specific use case carefully and to leverage the right techniques and tools to optimize your join conditions accordingly. Although watermarking could be optional, I would highly recommend you set a watermark on both streams.

sql_for_stream_stream_inner_join = f”“”
    SELECT 
        vehicle.*
        ,geo.latitude
        ,geo.longitude
        ,geo.zipcode
    FROM _streaming_vw_test_streaming_joins_vehicle vehicle
    JOIN _streaming_vw_test_streaming_joins_geo geo
    ON vehicle.event_id = geo.event_id
    -- Assume there is a business logic that timestamp cannot be more than X minutes off
    AND vehicle_timestamp BETWEEN geo_timestamp  - INTERVAL 5 MINUTES AND geo_timestamp
“”“
#display(spark.sql(sql_for_stream_stream_inner_join))

table_name_stream_stream_innner_join =’stream_stream_innner_join’

(   spark.sql(sql_for_inner_join)
    .writeStream
    #.trigger(availableNow=True)
        .queryName(f”write_to_delta_table: {table_name_stream_stream_innner_join}”)
        .option(”checkpointLocation”, f”{schema_storage_location}/{table_name_stream_stream_innner_join}/_checkpoint”)
        .format(”delta”)
        .toTable(f”{schema_name}.{table_name_stream_stream_innner_join}”)
)

If the stream has finished then in the next step. You should find that the row count should match up with the regular batch SQL Job

spark.read.table(f”{schema_name}.{table_name_stream_stream_innner_join}”).count()

How was the watermark computed in this scenario?

When we defined streaming views for Vehicle and Geo, we set them as 1 min and 2 min, respectively.

If you look at the join condition we mentioned :

AND vehicle_timestamp >= geo_timestamp - INTERVAL 5 minutes

5 min + 2 min = 7 min.

Spark Streaming would automatically calculate this 7 min number and the state would be cleared after that.

5. Left Joins with Watermarking

While the watermark + event-time constraints is optional for inner joins, for outer joins they must be specified. This is because for generating the NULL results in outer join, the engine must know when an input row is not going to match with anything in future. Hence, the watermark + event-time constraints must be specified for generating correct results.

5.a How Left Joins works differently than an Inner Join

One important factor is that the outer NULL results will be generated with a delay that depends on the specified watermark delay and the time range condition. This delay is necessary to ensure that there were no matches, and that there will be no matches in the future.

In the current implementation of the micro-batch engine, watermarks are advanced at the end of each micro-batch, and the next micro-batch uses the updated watermark to clean up the state and output outer results. However, this means that the generation of outer results may be delayed if there is no new data being received in the stream. If either of the two input streams being joined does not receive data for a while, the outer output (in both left and right cases) may be delayed.

sql_for_stream_stream_left_join = f”“”
    SELECT 
        vehicle.*
        ,geo.latitude
        ,geo.longitude
        ,geo.zipcode
    FROM _streaming_vw_test_streaming_joins_vehicle vehicle
    LEFT JOIN _streaming_vw_test_streaming_joins_geo geo
    ON vehicle.event_id = geo.event_id
        AND vehicle_timestamp BETWEEN geo_timestamp  - INTERVAL 5 MINUTES AND geo_timestamp
“”“
#display(spark.sql(sql_for_stream_stream_left_join))

table_name_stream_stream_left_join =’stream_stream_left_join’

(   spark.sql(sql_for_stream_stream_left_join)
    .writeStream
    #.trigger(availableNow=True)
        .queryName(f”write_to_delta_table: {table_name_stream_stream_left_join}”)
        .option(”checkpointLocation”, f”{schema_storage_location}/{table_name_stream_stream_left_join}/_checkpoint”)
        .format(”delta”)
        .toTable(f”{schema_name}.{table_name_stream_stream_left_join}”)
)

If the stream has finished, then in the next step. You should find that the row count should match up with the regular batch SQL Job.

spark.read.table(f”{schema_name}.{table_name_stream_stream_left_join}”).count()

You will find that some records that could not match are not being released, which is expected. The outer NULL results will be generated with a delay that depends on the specified watermark delay and the time range condition. This is because the engine has to wait for that long to ensure there were no matches and there will be no more matches in future.
**Watermark will advance once new data is pushed to it**

Thus let’s generate some more fate data to the base table: vehicle_geo. This time we are sending a much lower volume of 10 records per second. Let the below command run for at least one batch and then kill it.

stream_write_to_vehicle_geo_table(rowsPerSecond = 10, numPartitions = 10)

5. b What to observe:

Soon you should see the watermark moves ahead and the number of records in ‘Aggregation State’ goes down.
If you click on the running stream and click the raw data tab and look for “watermark”. You will see it has advanced
Once 0 records per second are being processed, that means your stream has caught up, and now your row count should match up with the traditional SQL left join

spark.read.table(f”{schema_name}.{table_name_stream_stream_left_join}”).count()

6. The cold start edge case: withEventTimeOrder

“When using a Delta table as a stream source, the query first processes all of the data present in the table. The Delta table at this version is called the initial snapshot. By default, the Delta table’s data files are processed based on which file was last modified. However, the last modification time does not necessarily represent the record event time order.
In a stateful streaming query with a defined watermark, processing files by modification time can result in records being processed in the wrong order. This could lead to records dropping as late events by the watermark.
You can avoid the data drop issue by enabling the following option:
withEventTimeOrder: Whether the initial snapshot should be processed with event time order.

If you use startingVersion then withEventTimeOrder attribute is ignored.

In our scenario, I pushed this inside Step 3 when we created the temporary streaming views.

spark.readStream.format(”delta”)
        .option(”maxFilesPerTrigger”, f”{maxFilesPerTrigger}”)
        .option(”withEventTimeOrder”, “true”)
        .table(f”{schema_name}.{table_name}”)

7. Cleanup

Drop all tables in the database and delete all the checkpoints

spark.sql(
    f”“”
    drop schema if exists {schema_name} CASCADE
“”“
)


dbutils.fs.rm(schema_storage_location, True)

If you have reached so far, you now have a working pipeline and a solid example which you can use going forward.

Download the code

https://github.com/jiteshsoni/material_for_public_consumption/blob/main/notebooks/spark_stream_stream_join.py

References:

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins

Footnote:

Thank you for taking the time to read this article. If you found it helpful or enjoyable, please consider clapping to show appreciation and help others discover it. Don’t forget to follow me for more insightful content, and visit my website CanadianDataGuy.com for additional resources and information. Your support and feedback are essential to me, and I appreciate your engagement with my work.

A Deep Dive into Skewed Joins, GroupBy Bottlenecks, and Smart Strategies to Keep Your Spark Jobs Flying

Canadian Data Guy — Fri, 06 Jun 2025 03:11:24 GMT

Data skew in Apache Spark refers to an uneven distribution of data across partitions, often manifesting during shuffle-intensive operations like joins or group-by aggregations. In a skewed scenario, one or a few partitions end up holding far more records for a particular key than others, leading to hotspots and straggler tasks. This imbalance causes performance bottlenecks (tasks processing heavy partitions take much longer) and inefficient resource usage (some executors sit idle). In extreme cases, heavily skewed partitions can even exhaust executor memory and cause job failures. Below, we delve into why skew occurs in joins and aggregations, and provide comprehensive strategies—ranging from Spark configuration tweaks to code-level patterns and architectural designs—to alleviate data skew.

Why Data Skew Occurs in Joins and Aggregations

Join Operations: In Spark (excluding broadcast joins), joining two datasets on a key requires redistributing data so that records with the same key end up on the same partition (for a shuffle hash join or sort-merge join). If the key distribution is highly uneven (e.g. one key value appears in 90% of the records), the partition handling that key will be massive compared to others, causing skew. All records for that popular key funnel into one task, creating a severe load imbalance. For example, consider joining a large transactions table with a user table on user_id when a few “power users” have the vast majority of transactions. The join partition corresponding to those user_ids will handle hundreds of thousands of records, while other partitions process only a few – resulting in stragglers and possibly out-of-memory errors.

GroupBy and Aggregations: Similarly, grouping or aggregating by a key brings all data for each key onto one executor. If some keys occur far more frequently than others, those keys’ partitions become disproportionately large. For instance, a groupBy("customer_id") on an orders dataset where a handful of customers account for most orders will produce skew: the reducer for those popular customers must aggregate an extremely large list, while others handle trivial amountsl. Even though Spark performs map-side partial aggregation, a single reduce task will still have to combine all intermediate results for a heavy key, leading to one very slow task.

Understanding these root causes guides us to solutions. Next, we address join skew and groupBy/aggregation skew separately, discussing targeted techniques for each.

How do we know if we have a Skew Problem?

To identify if there is a skew problem in Spark, several indicators and methods can be employed:

Task Duration Discrepancy:
- If all tasks in a shuffle stage finish except for a few that hang for a long time, this may indicate data skew.
Spark UI Analysis:
- Check the tasks summary metrics in the Spark UI. A significant difference between the minimum and maximum shuffle read sizes can suggest skewness.
Data Spills:
- If, despite tuning the number of shuffle partitions, there are numerous data spills, this might point to data skew.
Row Count Disparity:
- Counting rows grouped by join or aggregation columns can reveal skew. A significant difference in row counts for different groups indicates potential skew issues.
Compression Ratios:
- Highly compressed tables can affect the estimation of shuffle partitions, leading to spills. Monitoring this can help identify such cases.

Additionally, Spark SQL's Adaptive Query Execution (AQE) can help detect and sometimes resolve data skew dynamically by adjusting execution strategies as needed.

Mitigating Skew in Join Operations

When joining two datasets on a key, Spark must shuffle records so that identical keys end up on the same partition. If one key is heavily overrepresented, its partition can become a bottleneck. Below are strategies ordered from most to least recommended

1. Adaptive Query Execution (AQE) – Automatic Skew Handling

Spark 3.0+ introduced Adaptive Query Execution (AQE), which can dynamically detect and correct skewed partitions during runtime. When AQE is enabled, Spark measures the size of each shuffle partition after the initial shuffle. If it finds any partition that is both exceptionally large in absolute terms and multiple times larger than the median partition size, it automatically splits that partition into smaller sub-tasks and replicates the corresponding rows from the other side of the join so each sub-task can run independently.

How It Works

Collect Partition Statistics:
- After the shuffle phase, Spark records the size (bytes) of every partition on both sides of the join.
Identify Skewed Partitions:
A partition is marked as “skewed” only if it meets both criteria:
- Absolute‐Size Threshold: spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes Default: 256MB
- Relative‐Size Factor: spark.sql.adaptive.skewJoin.skewedPartitionFactor
  (Default: 5.0)
If the median shuffle‐partition size is 50 MB, a factor of 5.0 means any partition > 250 MB qualifies—provided it also exceeds the 256 MB absolute threshold.
Split & Replicate:
- Suppose partition #17 is 1 GB and the coalesced‐partition target is 250 MB. Spark divides that 1 GB into four ~250 MB sub-partitions.
- For a join, each of those sub-partitions must still see all matching rows from the opposite dataset. Spark duplicates those matching rows N times (once per sub-partition) so each sub-task can run a local join.
Run Subtasks in Parallel & Merge Results:
- Instead of a single, massive task pulling 1 GB, Spark launches N tasks (e.g., four tasks pulling ~250 MB each plus replicated rows).
- When those sub-tasks finish, Spark concatenates their outputs to produce the final joined result.

Because this splitting and replication occur after the initial shuffle—when Spark has accurate sizes—no query rewriting or manual “hints” are required.

Configuration

# Enable AQE (on by default in Spark 3.2+)
spark.sql.adaptive.enabled=true

# Enable skew-join correction
spark.sql.adaptive.skewJoin.enabled=true

# Absolute-size threshold for skewed partitions
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256MB

# Relative-size factor: if a partition is > factor × median size, it's skewed
spark.sql.adaptive.skewJoin.skewedPartitionFactor=5.0

# (Spark 3.3+) Force AQE to apply skew-join splitting even if it adds shuffle overhead
spark.sql.adaptive.forceOptimizeSkewedJoin=true

Pros & Cons

Pros:
- Zero code changes: No query rewrites, no manual hints.
- Runtime intelligence: Works on any sort-merge or shuffle-hash join where skew is severe.
- Eliminates straggler tasks without requiring you to identify skewed keys in advance.
Cons:
- Applies only to shuffle joins (sort-merge and shuffle-hash). Broadcast joins never shuffle, so they aren’t “skewed.”
- Splitting and replicating can introduce extra shuffle I/O; mild skew might not trigger or be worth splitting.
- You may need to tune thresholds (skewedPartitionThresholdInBytes and skewedPartitionFactor) to avoid splitting on nearly-skewed partitions.

Keep This Post Discoverable: Your Engagement Counts!

Your engagement with this blog post is crucial! Without claps, comments, or shares, this valuable content might become lost in the vast sea of online information. Search engines like Google rely on user engagement to determine the relevance and importance of web pages. If you found this information helpful, please take a moment to clap, comment, or share. Your action not only helps others discover this content but also ensures that you’ll be able to find it again in the future when you need it. Don’t let this resource disappear from search results — show your support and help keep quality content accessible!

Share CanadianDataGuy’s No Fluff Newsletter

2. Broadcast Hash Join (Small–Large Optimization)

If one side of a join is small enough to fit in memory on every executor, a broadcast hash join eliminates virtually all skew risk. By broadcasting the smaller dataset to every executor, Spark can join on the large side without shuffling it by key. Even a “hot” key on the large side is processed in parallel across many tasks, because each task already has the complete, in-memory copy of the smaller table.

How It Works

Spark Optimizer Picks It Automatically (if small side ≤ 10 MB by default):
- Controlled by:
  spark.sql.autoBroadcastJoinThreshold (Default: 10MB)
- Raise this value to allow larger small tables but not more than 1 GB practically
Explicitly Force Broadcast in DataFrame Code:

from pyspark.sql.functions import broadcast

result = largeDF.join(broadcast(smallDF), "joinKey")

Spark SQL Hint:

SELECT /*+ BROADCAST(s) */ *
FROM large l
JOIN small s
  ON l.joinKey = s.joinKey;

Since the large dataset is not shuffled by key, no single reducer processes all rows for a heavy key. Instead, each task hashes the broadcasted small side in-memory, and streams its assigned partitions of the large side through that hash.

Pros & Cons

Pros:
- No shuffle on large side—completely eliminates skew related to the small side.
- Simple to implement via broadcast() hints or by tuning spark.sql.autoBroadcastJoinThreshold.
- Dramatic speedups when one side is truly small and the other side has a hot key.
Cons:
- The “small” table must fit comfortably in each executor’s memory. If it’s too large (hundreds of MB), broadcasting can create memory pressure or OOM.
- Not applicable when both sides are large.
- Total cluster memory usage for the small table = (# executors) × (size of small table).

3. Handling Skewed Keys Separately (Divide & Conquer)

If you know exactly which key(s) are skewed, you can split your data into two subsets—the skewed-key subset and the “rest”—process them separately, then recombine

caption...

How It Works

Split Each Dataset into “Skewed” vs. “Rest”:

skewed_keys = ["USA"]

# Dataset A (large or small, doesn’t matter)
A_skew   = A.filter(F.col("country") == "USA")
A_rest   = A.filter(F.col("country") != "USA")

# Dataset B
B_skew   = B.filter(F.col("country") == "USA")
B_rest   = B.filter(F.col("country") != "USA")

Join the “Rest” Subsets Normally:

main_join = A_rest.join(B_rest, "country")

Since “USA” is removed, these partitions will be balanced—assuming no other keys are extremely skewed.

Join the “Skewed” Subsets Separately with an Optimized Strategy:

If B_skew is small enough, broadcast it:

skew_join = A_skew.join(broadcast(B_skew), "country")

Otherwise, you could salt only the “USA” key (as shown above) or use any other technique.

Union the Two Results:

final_result = main_join.unionByName(skew_join)

Pros & Cons

Pros:
- Simplicity: Process the skewed key in isolation; non-skewed data is untouched.
- You choose exactly how to handle the problematic key (e.g., broadcast, salt, or extra resources).
- No need to change logic for the majority of keys.
Cons:
- Requires an extra read/scan (filter) on each dataset—though filter is usually cheap.
- Increases job complexity: two join operations instead of one.
- If more than one key is skewed, you must repeat this process for each key or group of keys—still subject to skew within that sub‐subset.
- Must identify skewed key(s) beforehand.

4. Salting Every Key (Uniform Distribution Across N Buckets)

In real-world joins—especially at scale—any single key with extremely high cardinality (for example, a superstar YouTuber like “mr_beast”) can overwhelm one partition, leading to severe performance bottlenecks. While you might compensate by detecting and salting just that one “hot” key, a more robust approach is to uniformly salt every youtuber_id, ensuring that even unexpected popularity spikes are handled gracefully. By applying a deterministic salt to all keys, each youtuber_id is augmented with a bucket index, distributing its rows across up to N partitions. Matching rows from both tables still join correctly because the salt is derived deterministically from the join key (and potentially another column like video_id).

How It Works

Choose a Salt Count (N)
- Decide how many buckets to split every youtuber_id into (for example, N = 10).
- Aim for each salted partition to be on the order of 100–300 MB (or your target). Use the Spark UI’s “Shuffle Read Size by Task” to gauge ideal bucket size.
Compute a Deterministic Salt for Each Row
- For each row, compute:

salt = abs(hash(concat(youtuber_id, video_id))) % N
salted_youtuber = CONCAT(youtuber_id, "_", salt)

This ensures:

All rows belonging to the same (youtuber_id, video_id) produce the same (salted_youtuber, video_id) pair in both tables.
Every youtuber_id is split across up to N buckets—popular keys will spread widely, less-popular keys may cluster in fewer buckets if they have fewer distinct video_id values.

3. Salt Both Tables in PySpark

import pyspark.sql.functions as F

N = 10

def saltAllExpr(yid_col, vid_col):
    """
    Deterministic salt for every (youtuber_id, video_id):
    salted_youtuber = youtuber_id + "_" + (abs(hash(youtuber_id || video_id)) % N)
    """
    return F.concat(
        yid_col,
        F.lit("_"),
        (F.abs(F.hash(F.concat(yid_col, vid_col))) % N).cast("string")
    )

# Salt the IMPRESSIONS table
salted_impressions = impressions.withColumn(
    "salted_youtuber",
    saltAllExpr(F.col("youtuber_id"), F.col("video_id"))
)

# Salt the CLICKS table
salted_clicks = clicks.withColumn(
    "salted_youtuber",
    saltAllExpr(F.col("youtuber_id"), F.col("video_id"))
)

Every (youtuber_id, video_id) pair gets a consistent bucket index in [0..9].
For "mr_beast" with video_id = "abc123", salted_youtuber = "mr_beast_4" (for example).
A different video "xyz789" might map to "mr_beast_7".
A less-popular youtuber with only one or two videos may occupy only 1–2 buckets—but that’s fine.

Perform the Salted Join

joined = salted_impressions.alias("imp").join(
    salted_clicks.alias("clk"),
    on=[ "salted_youtuber", "video_id" ],
    how="inner"
)

Before salting: All "mr_beast" rows (across any video_id) would land in a single partition.
After salting: Each distinct (youtuber_id, video_id) combination goes to a bucket youtuber_id_<0..9>, so "mr_beast" content spreads across up to 10 partitions—one per bucket index.
This eliminates a single “hot” partition for "mr_beast".

Spark SQL Equivalent

WITH salted_impressions AS (
  SELECT
    *,
    CONCAT(
      youtuber_id,
      '_',
      CAST(ABS(hash(CONCAT(youtuber_id, video_id))) % 10 AS STRING)
    ) AS salted_youtuber
  FROM impressions
),
salted_clicks AS (
  SELECT
    *,
    CONCAT(
      youtuber_id,
      '_',
      CAST(ABS(hash(CONCAT(youtuber_id, video_id))) % 10 AS STRING)
    ) AS salted_youtuber
  FROM clicks
)
SELECT
  imp.*,
  clk.viewer_id,
  clk.timestamp AS click_timestamp
FROM salted_impressions imp
JOIN salted_clicks clk
  ON imp.salted_youtuber = clk.salted_youtuber
 AND imp.video_id       = clk.video_id;

Each (youtuber_id, video_id) deterministically maps to one of 10 buckets.
Even if "mr_beast" has 100 videos, those 100 distinct (youtuber_id, video_id) pairs spread across up to 10 buckets.

Pros & Cons

Pros:

Uniform Distribution for All Keys
Any youtuber with many videos—like "mr_beast"—will spread its rows across N buckets.
No Conditional Logic on “Hot” Keys
You don’t need to first identify which youtuber is skewed; every key is salted uniformly.
Deterministic
Matching (youtuber_id, video_id) always end up in the same bucket on both sides, so joins remain correct.
Works for Any Join
Applies whether one or both tables are large—no reliance on broadcast.

Cons:

Extra Shuffle Volume
Every row in both tables carries an extra salted key, and all rows must shuffle by (salted_youtuber, video_id).
- If a youtuber is lightly used, its rows may end up in only one or two buckets—but they still shuffle.
- If data was quite balanced originally, salting “everything” may introduce more shuffle than strictly necessary.
Choosing the Right N Is Crucial
- If N is too small, heavily skewed keys (like "mr_beast") still concentrate too much data in one bucket.
- If N is too large, you create many small partitions, which increases scheduler overhead.
Need to Drop salted_youtuber After the Join
If you only care about the original key (youtuber_id), drop salted_youtuber once the join is done.

When to Use “Salt Everything”

Use this approach when:

You don’t know in advance which keys will be skewed (e.g., an Uber driver of the week suddenly goes viral, or any youtuber’s popularity spikes).
Data volume is large and dynamic, and you want a one‐size‐fits‐all solution rather than conditionally checking for hot keys.
You want consistent distribution for all (youtuber_id, video_id) pairs without maintaining a list of skewed keys.

Additional Considerations (Ideally; try to avoid getting into these)

Tuning Shuffle Partitions

Adjust: spark.sql.shuffle.partitions to a value higher than the default (200), ideally a few times your cluster’s total cores, so that partitions remain small. Too many partitions cause scheduler overhead; too few cause each partition to be large.

Speculative Execution: Enabling speculation (spark.speculation=true) can alleviate the impact of skew by attempting to re-run straggling tasks on another executor. This doesn’t fix the skew itself, but if a task is slow (perhaps due to skew or maybe a slow node), Spark will launch a duplicate task elsewhere. Whichever finishes first wins. In a skew scenario, a speculated task is still doing the same heavy work, so it won’t magically complete faster unless the original executor was anomalously slow. However, speculation can sometimes help if, say, one executor was busy with garbage collection while another could do the work faster – it provides a safety net for stragglers. It’s generally good to enable in large clusters, but note it causes extra resource usage for those duplicate tasks.
Monitoring with the Spark UI
- In the Stages tab, expand a SQL stage and click Physical Plan.
- Under Shuffle Read Size by Task, look for a single bar that towers over the others—that’s your skewed partition.
- Use those insights to decide between AQE or manual salting.
Filtering Out Problematic Rows
- If certain values (e.g., NULL or outliers) cause extreme skew but are not essential, you can drop them before the join, Only do this if you can accept losing those rows from the result.

cleanedDF = originalDF.filter(F.col("country").isNotNull())

Use Skew Hints (Spark 3.4+)
- You can annotate specific keys as skewed in a Spark SQL query so that Spark generates a plan that avoids shuffling them into a single reducer
Memory and Shuffle Tuning: While not fixing skew, you might need to adjust memory configs to handle it. For instance, if one partition is huge, increasing executor memory or shuffle buffer sizes (spark.shuffle.spill.numElementsForceSpillThreshold, spark.shuffle.file.buffer, etc.) won’t solve the skew but might prevent OOM crashes by allowing Spark to spill gracefully. Similarly, ensure spark.memory.fraction or spark.sql.autoBroadcastJoinThreshold are set such that the heavy data can be handled (e.g., give more memory to shuffle if needed). These are more about coping with skew than removing it.
Adaptive Query Execution (AQE): As discussed, ensure spark.sql.adaptive.enabled=true (should be default on modern Spark) and spark.sql.adaptive.skewJoin.enabled=true. You can adjust spark.sql.adaptive.skewJoin.skewedPartitionFactor (default 5) and ...skewedPartitionThresholdInBytes (default 256MB) to tune how aggressively Spark flags partitions as skewed Lowering these values makes Spark split smaller skews, but setting them too low might cause unnecessary splitting. In Spark 3.3+, if you really want to force skew join handling, spark.sql.adaptive.forceOptimizeSkewedJoin=true will apply the optimization even if it might add extra shuffle overhead.

Tackling Skew in Spark Aggregations: From Simple Sums to Semi-Additive Metrics

Aggregation operations like groupBy().agg() in Spark can become major performance bottlenecks when data is skewed. A small number of high-cardinality keys can result in uneven workload distribution, where one reducer is overloaded while others remain idle. While Spark's map-side partial aggregation helps, it alone can’t prevent reducers from becoming overwhelmed when skewed keys funnel massive data into single tasks.

In this deep dive, we’ll explore practical patterns to mitigate skew during aggregations, especially focusing on semi-additive metrics like averages, distinct counts, and ratios—metrics that can't always be merged as trivially as sums or counts.

1. Two-Stage Aggregation with Salting

The most effective method for aggregation skew is a two-stage salted aggregation. In the first stage, you add a salt (random or deterministic) to the key, distributing rows across more groups. In the second stage, you aggregate these partials back to the original key.

How It Works:

Add a new column (e.g., salt = rand() % N) to the grouping key
Group by (key, salt) and compute partial aggregates
Re-group by key to merge the partials

PySpark Example:

from pyspark.sql.functions import col, rand, floor, sum as _sum, count as _count

N = 10
salted_df = df.withColumn("salt", floor(rand() * N))

# First stage: partial aggregation
partial = salted_df.groupBy("key", "salt").agg(
    _sum("value").alias("partial_sum"),
    _count("value").alias("partial_count")
)

# Second stage: final aggregation
final = partial.groupBy("key").agg(
    _sum("partial_sum").alias("total_sum"),
    _sum("partial_count").alias("total_count")
)

This works well for semi-additive metrics like average:

final.withColumn("avg", col("total_sum") / col("total_count"))

Pros:

Greatly reduces skew on hot keys
Flexible: works for sums, counts, averages, etc.

Cons:

Not directly applicable to non-associative metrics (like median, percentile)
Requires an extra stage of aggregation and data shuffle
You must choose N carefully

2. Favor Combiner-Friendly DataFrame Operations

In the DataFrame API, Spark automatically performs map-side combine for aggregation functions like sum, count, and avg. This significantly reduces data shuffled across the network.

Best Practices:

Avoid collecting all values per key using collect_list or collect_set unless needed
Prefer built-in aggregation functions that support partial aggregation

Example:

df.groupBy("user_id").agg(
    _sum("impressions").alias("total_impressions"),
    _count("clicks").alias("click_count")
)

This automatically benefits from map-side combine.

3. Hierarchical or Incremental Aggregation

Instead of grouping by the final key directly, first group on a compound key (e.g., key + day), then roll up to the main key. This acts like salting but uses a meaningful secondary attribute.

Example: Group by (customer_id, date), then group again by customer_id.

Pros:

Uses natural structure in data
More interpretable than random salt

Cons:

Only works if meaningful secondary keys exist
Adds complexity to query logic

4. Isolate Skewed Keys

When just a few keys are skewed (e.g., "mr_beast" on YouTube), isolate them:

Filter the skewed keys
Aggregate them separately
Aggregate the rest normally
Union results

Pros:

Simple logic for non-skewed keys
You can fine-tune treatment of skewed keys

Cons:

Manual, doesn’t scale to many skewed keys
Separate logic paths = more complexity

Special Note: Semi-Additive Metrics

For metrics like averages, ratios, or distinct counts, special care is needed:

Average: Use partial sums and counts, then divide
Ratios: Keep numerator/denominator separate, aggregate both, then divide
Count Distinct: Use approx_count_distinct() for scalable approximations

Some metrics cannot be split and recombined (e.g., exact percentiles). In those cases, use isolation or rethink the need for exact aggregation.

Final Thoughts for Aggregates

Aggregation skew is an invisible killer in Spark jobs. The best strategy is proactive design: salt heavy keys, use partial aggregation, and always choose APIs that favor combiners. With these patterns, even semi-additive or tricky metrics can be made scalable at massive volumes.

If you're dealing with skew, don't just throw resources at it. Design for it.

Summary of Recommendations for Joins

Recommended first Adaptive Query Execution (AQE): Zero code changes, runtime splitting for any sort-merge or shuffle-hash join.
Broadcast Hash Join
- When one side is small (≤ 10 MB by default to 1GB). Hint in DataFrame or SQL.
- Avoids all skew because no shuffle on the small side.
Salting the Key
- When neither side is small, but you know exactly which key(s) dominate.
- Manual, but guaranteed to split a hot key across N partitions.
Handle Skewed Keys Separately
- When you can isolate a small number of skewed keys.
- Split data into “skewed” vs. “rest”; optimize skewed subset, then union.

By applying these strategies in order—starting with AQE’s automatic handling, then broadcasting small tables, and, if necessary, resorting to manual salting or custom partitioning—you can eliminate or dramatically reduce skew-related stragglers in your Spark join operations. Choose the approach that best fits your cluster’s Spark version, data volume, and the complexity you’re willing to maintain.

Architectural Patterns and Data Design to Reduce Skew

Beyond individual Spark jobs, you can sometimes address skew at the data architecture level to prevent issues before they happen:

Skew-Aware Data Partitioning: As discussed, designing how data is partitioned or bucketed in storage can reduce skew. For example, if you frequently group or join by a key that’s skewed, consider storing the data partitioned by that key and a secondary split. A real-world practice: if one category of data is 90% of the dataset, you might partition that category’s data further by another field. Essentially, acknowledge the skewed key in your data model and subdivide it. This could mean separate tables or partitions for heavy categories. When you process the data, you then handle those partitions in parallel. The benefit is you're not repeatedly shuffling the entire dataset to discover the same skew; you’ve pre-divided it.
Pre-Aggregation / Summaries: If your use-case allows, maintain rolling aggregates for skewed keys. For instance, if one user has a million events per day and you always compute their daily total, consider updating a running total for that user in a database or a separate file, rather than recomputing from scratch in each Spark job. By reducing the raw data volume for that key through prior aggregation, you avoid the huge shuffle for that key at query time. This is applicable in pipelines where data is appended incrementally (common in streaming or daily ETL). You trade off storage (keeping summary data) for performance.
Alternate Algorithms: In some cases, you might choose a different approach entirely. For example, for a skewed distinct count, using an approximate algorithm (like HyperLogLog) per partition can avoid bringing all data together. Or using Bloom filters to reduce data before join (filter out records that won’t match). These are specific to certain problems but can mitigate skew by cutting down the data processed.
Scaling Up Hot Data Separately: This is more of an infrastructure pattern – if one key’s data is massive, you could route that to a specialized system. For instance, maybe that one key corresponds to a particular customer – you could give them their own dedicated processing or database, and exclude those records from the general Spark workflow. It’s an extreme solution, but sometimes separating concerns (multi-tenancy isolation) helps if one tenant’s data skews the whole system.
Monitoring and Iteration: A softer “pattern” is to continuously monitor your Spark job metrics (especially in Spark UI or via logs) to catch skew issues and then adjust. Over time, you may adapt your data ingestion or job logic to handle new skewed keys as data grows. For example, if a new user becomes a power user, you might add them to the “skewed key list” for salting. In practice, skew patterns can change, so an architecture that can adjust (or a code path that can automatically detect top N heavy keys and treat them differently) can be very useful.

In essence, architectural approaches are all about not putting all eggs in one basket – distribute data smartly from the ground up, and treat the outliers with special care. This reduces the burden on any single Spark job to handle an immense skew on the fly.

References

Why Your PySpark UDF Is Slowing Everything Down

Canadian Data Guy — Thu, 24 Apr 2025 22:39:47 GMT

1. Introduction

PySpark’s User Defined Functions (UDFs) empower developers to inject custom Python logic into Spark DataFrames. They feel like a convenient escape hatch when built-in SQL functions don’t cut it. However, under the hood, each UDF invocation triggers a complex ballet of inter-process communication, serialization, and single-threaded Python loops. This blog peels back each layer of that architecture to reveal why PySpark UDFs can become a massive performance drain — and then walks through concrete alternatives and optimizations to keep your jobs blazing fast.

2. The Problem with PySpark UDFs

When you sprinkle UDF calls across your Spark SQL or DataFrame pipeline, you’re effectively handing off portions of your query plan to a “black box” Python function. That comes at a steep cost:

2.1 Catalyst Optimizer Becomes Blind

No predicate pushdown: Spark’s Catalyst optimizer can’t inspect or reorder the logic inside your UDF, so it abandons optimizations like pushing filters down to data sources.
No whole-stage code generation: The code-gen engine can’t fuse your UDF into JVM bytecode, so you lose out on compiler-level speed gains.

2.2 Serialization/Deserialization Overhead

Row-by-row data shuffling: Each row must be marshalled from the JVM heap into a Python object, sent over a local socket, then converted back. After your Python code runs, the result takes the reverse path back into the JVM.
Millions of crossings: With millions (or even billions) of rows, that boundary-crossing cost balloons.

2.3 Single-Threaded Python Execution

Global Interpreter Lock (GIL): Your UDF runs in a standard CPython process under a single core. All per-row work happens sequentially.
ide the UDF.

2.4 Memory and Stability Risks

Python OOMs: Unlike JVM operations, Spark doesn’t manage Python worker memory. Processing large batches can crash with out-of-memory errors.
Uncaught exceptions: A bug in your UDF can fail an entire Spark task. Null handling, pickling errors, and non-serializable closures often catch teams by surprise.

3. Under the Hood: PySpark’s Dual-Runtime Architecture

Py4J is a communication bridge/library that lets Python and Java interoperate by exchanging objects over sockets. In Spark, it powers two key workflows: setting up the Python SparkContext and converting data types in PySpark SQL. When you start a PySpark session, Py4J opens a socket connection between your Python driver and the underlying Java driver. Later, whenever Spark SQL operations run, Py4J translates Python types into their Java equivalents (and back) so the Python API can seamlessly drive the JVM-based SQL engine. Under the hood, every Python UDF invocation follows this path:

Python Driver → SparkContext → Py4J → JVM → JavaSparkContext

Because each UDF call must cross this socket boundary, it adds measurable latency to your job.

3.1 Py4J: Bridging Python and the JVM

At startup, PySpark uses Py4J to:

Connect the Python driver to the JVM driver.
Translate data types between Python and Java during SQL operations and UDF calls.

Every call into Spark SQL or a UDF crosses this bridge — think of it as a high-latency tunnel for each record.

3.2 Driver, Executors, and Python Workers

Driver (Python process): You call df.withColumn("foo", my_udf(col("bar"))).
JVM Driver: Receives the UDF registration, plans the query.
Executor JVMs: Spin up separate Python subprocesses per task.
Python Workers: Handle the actual UDF logic on deserialized batches.

4. Lifecycle of a PySpark UDF Call

4.1 Registration & Serialization of the Python Function

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def uppercase(val):
    return val.upper()

uppercase_udf = udf(uppercase, StringType())

_create_udf wraps your Python function into a serializable form and tags it with return types.
UDF object travels in the Spark plan to all executors.

4.2 Data Flow on Executors

Executor receives a task partition.
JVM serializes partition rows into Arrow or Pickle bytes.
Bytes stream over TCP to the Python worker.
Python worker deserializes, applies your function row-by-row.
Results are serialized back to JVM for further operators.

4.3 Detailed Serialization Cycle

JVM row object
  └─serialize─▶ Python bytes
      └─deserialize─▶ Python object
           └─apply UDF─▶ Python object
                └─serialize─▶ Python bytes
                     └─JVM bytes
                          └─deserialize─▶ JVM row

Multiply that by every row, every partition, every stage — and you see why simple operations feel so sluggish.

5. Performance Implications

5.1 Quantifying the Overhead

Catalyst loss: 10–30% longer query planning in UDF-heavy jobs.
Serialization tax: 0.5–5 ms per row crossing (tested on medium-sized clusters).
CPU utilization: < 25% CPU usage across nodes despite heavy transforms.

5.2 Real-World Benchmark Example

Scenario: Uppercasing a 100 million-row column.
Native Spark SQL:
df.selectExpr("upper(name) as name")
→ 12 seconds end-to-end
Python UDF:
df.withColumn("name", uppercase_udf("name"))
→ reorders, serialization, single-thread overhead → 85 seconds
7× slower for a trivial transform.

6. Strategies for Faster Custom Logic

6.1 Leverage Built-in Spark Functions

Whenever possible, reach for Spark’s SQL functions (upper, concat, regexp_replace, etc.) — they run entirely in the JVM, enjoy whole-stage codegen, and scale across all cores.

6.2 Pandas UDFs (Vectorized)

Introduced in Spark 2.3, Pandas UDFs batch rows into pandas.Series and use Apache Arrow for zero-copy transfer.

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType
import pandas as pd

@pandas_udf(StringType())
def upper_series(s: pd.Series) -> pd.Series:
    return s.str.upper()

df.withColumn("name", upper_series("name"))

Batch size: Typically 8 K–64 K rows per call
Vectorized ops: Internal loops in C, parallelized across cores in Python worker
Results: 5–10× speed-up over row-UDFs

6.3 Scala/Java UDFs

If you need custom logic beyond SQL but want JVM speed:

Write a Scala object implementing UserDefinedFunction.
Register it via spark.udf.registerJava(...).
Invoke from PySpark as if it were a native function.

No Python serialization needed.
Runs inside the executor JVM with full multi-core utilization.

6.4 Threading & Parallelism in Python UDFs

If you absolutely must call an external API or library row-by-row:

Use multithreading inside your Python UDF to hide network latency.
Batch HTTP calls where possible.
Be cautious: GIL still applies for CPU-bound work, and thread pools can exhaust memory.

7. Common Pitfalls & Debugging Tips

PicklingError: Ensure functions and closures reference only top-level functions and serializable objects.
Null handling: Always guard inputs with if v is None: return None.
Schema drift: Explicitly set return types; mismatches lead to confusing errors at shuffle boundaries.
Memory leaks: Monitor Python worker logs for MemoryError and tune spark.python.worker.memory.

8. Summary & Best Practices

Our newsletter is 100% free and always will be, but without your claps, comments, or shares, search engines may bury this post forever. A quick clap not only tells us this content resonates but also makes sure you (and everyone else) can find it again when it matters most.

Avoid plain Python UDFs whenever built-in Spark SQL functions suffice.
Prefer Pandas UDFs for vectorized, batch transforms—they dramatically reduce boundary crossings via Apache Arrow. In fact, the vectorized nature and rapid Arrow improvements often make Pandas UDFs faster than even Scala/Java UDFs.
Consider Scala/Java UDFs only when you need JVM-native logic that can’t be expressed in SQL or Pandas UDFs.
Design for serializability: keep UDFs self-contained, stateless, and null-safe.
Benchmark early: compare native vs. Pandas vs. Python vs. Scala/Java UDFs on representative data.
Moving forward, hands down use native functions first, then Pandas UDFs in almost all cases.
When you must call external APIs inside a UDF loop, embed threading or async parallelism to help latency—see this video on parallelization within a loop for an example.

By understanding the multi-stage journey of data through the PySpark UDF pipeline — from JVM serialization, through Python’s single-threaded interpreter, back to the JVM — you can make informed choices that balance flexibility with performance. Next time you need custom logic, pause to ask: “Can I batch or vectorize? ” Your cluster (and your users) will thank you.

To learn more about how to improve things, read our deep dive blog on Pandas UDF.

References

Ganesh, R. “Is really UDF hitting the performance in PySpark!” Medium, Jul 5, 2024. Medium
AWS Documentation. “Optimize user-defined functions,” Tuning AWS Glue for Apache Spark (AWS Prescriptive Guidance). AWS Documentation
Tang, T. “Spark functions vs UDF performance?” Stack Overflow, Mar 5, 2018. Stack Overflow
Databricks. “Arrow-optimized Python UDFs in Apache Spark™ 3.5,” Databricks Blog, Aug 26, 2024. Databricks
“Why You Should Avoid Using UDFs in PySpark,” Det.Life Blog, Jan 2024. Data Engineer Things
Illustrious_Ad4259. “Are there any major disadvantages in performance for Spark when using PySpark?” Reddit r/dataengineering, Nov 2021. Reddit
Sen, Soutir. “PySpark UDFs (User-Defined Functions) – Complete Guide,” LinkedIn Article, Dec 2024. linkedin.com
Two Sigma. “Introducing Pandas UDFs for PySpark,” Two Sigma Article.

Spark Join Strategies Explained: Broadcast Hash Join

Canadian Data Guy — Mon, 14 Apr 2025 14:00:00 GMT

Apache Spark employs multiple join strategies to efficiently combine datasets in a distributed environment. This guide provides a zero-to-hero explanation of the three primary join strategies – Broadcast Hash Join (BHJ), Shuffle Hash Join (SHJ), and Sort-Merge Join (SMJ) – with a focus on Databricks. We will explore how each strategy works, their execution plans (DAG stages, partitioning, memory and shuffle behavior), and how to tune these joins on Databricks (including relevant configurations like AQE and join hints). A visual cheat sheet and further reading resources are provided at the end.

Introduction to Spark Join Strategies

In Spark SQL, a join combines two datasets by matching rows on a common key. The way Spark executes the join greatly impacts performance, especially with large data. Spark’s Catalyst optimizer will choose a join strategy based on data statistics (size of each side, join type, etc.), or you can influence it via hints and settings. The three main join strategies for equi-joins are:

Broadcast Hash Join (BHJ) – Broadcasts the entire smaller dataset to all executors, avoiding shuffles for that side, Very fast when one side is sufficiently small, analogous to a map-side join in Hadoop
Shuffle Hash Join (SHJ) – Shuffles both datasets on the join key, then builds a hash table on the smaller side of each partition and streams the larger side to find matches.
Avoids the sort step of SMJ but requires enough memory per partition.
Sort-Merge Join (SMJ) – Shuffles both datasets on the join key and sorts them, then merges sorted partitions to find matches. This is Spark’s default strategy for large data and supports all join types . It’s robust (can spill to disk if needed) but involves heavy network and CPU overhead for sorting.

Each strategy has optimal use cases and pitfalls. In Databricks (which uses Spark under the hood), adaptive query execution (AQE) can dynamically optimize joins (e.g. switching strategies or handling skew) to improve performance. We’ll now dive into each strategy in detail.

What is a Broadcast Hash Join (BHJ)?

A Broadcast Hash Join is an efficient strategy used to join two datasets in Spark when one of them is significantly smaller than the other. Instead of moving data across the network (shuffling) for both sides of the join, Spark copies—or "broadcasts"—the entire small dataset to every worker node (executor). Then, each executor performs a local hash join between its partition of the larger dataset and the entire, locally cached, small dataset. This approach helps to avoid expensive network shuffling and the need for sorting on either side of the join.

The Broadcast Process in Detail

The broadcast procedure involves:

Collecting the Data:
The driver first gathers the entire small dataset and converts it into an efficient in-memory data structure (typically a hash map).

Distributing the Data:
This hash map is then distributed (broadcast) to all executor nodes, usually via a network distribution algorithm akin to torrent distribution.
Utilizing the Broadcast Data:
Each executor then uses the broadcasted data to quickly look up matching join keys when processing its partition of the larger dataset.

Understanding these steps is crucial because if any stage fails—whether due to memory limits on the driver, executor constraints, or even network issues—the entire query may fail.

When Does Spark Use BHJ?

Spark will automatically choose to perform a Broadcast Hash Join under these conditions:

Dataset Size: One side of the join is smaller than a pre-configured threshold, which is by default 10 MB in open-source Spark. In Databricks environments, this threshold is commonly increased (e.g., ~30 MB with adaptive execution), meaning Databricks can handle moderately larger tables.
Join Type: The join condition is an equality condition (equi-join).

The setting spark.sql.autoBroadcastJoinThreshold controls this threshold and can be adjusted based on available memory and expected performance benefits.

BHJ works well with these join types:

Supported: Inner joins, and left, semi, or anti joins (as long as the correct side is broadcast).
Limitations: It is not supported for full outer joins. For right outer joins, only the left table can be broadcast; similarly, in left joins only the right table can be broadcast.

If the join type is not supported by a BHJ, Spark may revert to another join strategy, such as a sort-merge join or a broadcast nested loop join when dealing with non-equi conditions.

Databricks and Adaptive Query Execution (AQE)

In Databricks:

Adaptive Query Execution (AQE): AQE can dynamically convert a sort-merge join into a broadcast hash join if it determines at runtime that one side of the join is smaller than the broadcast threshold.
Higher Thresholds: Databricks’ default setting for auto-broadcast (often spark.databricks.adaptive.autoBroadcastJoinThreshold) may be set higher (e.g., 30 MB) to allow for broadcasting moderately larger tables.
Forcing Broadcasts: Although AQE works automatically, you might sometimes use explicit hints (such as /*+ BROADCAST(table) */ in SQL or wrapping a DataFrame with broadcast(df) in PySpark) to ensure the small dataset is broadcast immediately, thereby skipping unnecessary shuffles.

Common Misconception- Order of Joins

For optimal join order performance: Perform joins from smallest to largest tables first to minimize data shuffling⁠⁠ However, do broadcast joins last, even though this seems counterintuitive. This is because:⁠⁠

Broadcast joins don't require shuffles and can be executed efficiently even on large fact tables
If broadcast joins are done first, the joined data needs to be shuffled again for later joins
By doing broadcast joins last, we avoid having to shuffle that data again.
Group together joins that share the same ON clause to reduce shuffling, since the data is already arranged properly

Memory and Shuffle Considerations

Using BHJ provides tremendous speedups by eliminating the costly shuffle of the larger dataset. However, it comes with some significant memory considerations:

Driver Memory: The whole small dataset must be collected on the driver before it can be broadcast. The driver has a memory limit, defined by spark.driver.maxResultSize, and exceeding this limit will cause the job to fail.
Executor Memory: Each executor must have enough memory to store the broadcasted dataset along with its own processing workload. The available memory on the node with the smallest capacity is the practical limit.
Timeout and Overload Risks: If the dataset is even moderately large, broadcasting it might overwhelm the driver or network, leading to out-of-memory (OOM) errors or timeouts. For example, while Databricks has even seen broadcasts for datasets up to a few GB in size, one must exercise extreme caution when attempting such operations.
Compression Differences: Note that the on-disk size of data (like Parquet files in Delta tables) might be much smaller than the in-memory representation. Spark’s decisions are based on disk size, so actual in-memory data after decompression might far exceed the expected limits.

To address these issues, you can either disable auto-broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or lower the threshold to ensure no large table is inadvertently broadcasted. On Databricks with the Photon engine, executor-side broadcasts further alleviate pressure on the driver because the broadcast process does not rely solely on the driver's resources.

Performance Recommendations

When to Use BHJ:
Use Broadcast Hash Join when one dataset is much smaller than the other. This is commonly the case when joining large fact tables with much smaller dimension tables or when one table is the result of a selective filter.
Why Forcing Broadcasts:
While Spark’s optimizer may choose to broadcast small datasets automatically, in complex queries or skewed datasets the statistics might not be accurate. In those cases, manually forcing a broadcast using explicit hints ensures that the join operation skips the shuffle stage and executes as a broadcast join.
Caution in Production:
Forcing broadcasts in ad hoc queries or development is acceptable. However, in production workloads, it’s important to validate the dataset size at runtime. This can be done by checking record counts and partition sizes to avoid overloading any executor or the driver. Monitoring the Spark UI is critical to ensure broadcasts do not result in GC (garbage collection) pressure or other resource issues.

Example SQL with Broadcast Hint

To explicitly force a broadcast in SQL, you can include the following hint in your query:

SELECT /*+ BROADCASTJOIN(table1)*/ table1.id, table1.col, table2.id, table2.int_col FROM table1 JOIN table2 ON table1.id = table2.id;

In the physical plan, you will see a BroadcastExchange operator for the small table along with a BroadcastHashJoin operator, indicating that the join was executed without additional shuffling of the large table.

SQL Query : 
select /*+ BROADCASTJOIN(table1)*/ table1.id,table1.col,table2.id,table2.int_col from table1 join table2 on table1.id = table2.id

Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false\n
  +- BroadcastHashJoin [id#271L], [id#286L], Inner, BuildLeft, false
 :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint,   false]),false), [id=#955]
       :  +- Filter isnotnull(id#271L)
       :     +- Scan ExistingRDD[id#271L,col#272]
               +- Filter isnotnull(id#286L)
                 +- Scan ExistingRDD[id#286L,int_col#287L]

Number of records processed: 799541
Querytime : 15.35717314 seconds

Key Pitfalls and Best Practices

Avoid Broadcasting Too Much Data:
Never broadcast a table that is too large (generally over 1GB) as it can overwhelm the driver and executors. Spark has a hard limit (roughly 8GB) on what it can broadcast.
Watch for Non-Equi Joins:
BHJ only supports joins using equality conditions (equi-joins). When using non-equi join conditions (such as range conditions), BHJ cannot be applied.
Force with Caution:
When you force a broadcast using hints or functions like broadcast(df), you bypass Spark’s adaptive query execution optimizations. This is useful if you are sure the data size is small, but can cause performance issues if the dataset unexpectedly grows.
Plan for Memory Needs:
Increase the broadcast thresholds only if your driver and executors have ample memory. For instance, a driver with 32GB+ memory might safely use higher thresholds (like 200MB). Be sure to also configure spark.driver.maxResultSize appropriately to avoid driver-level memory errors.

Production Advice

When deploying BHJ in production workloads, careful planning and ongoing monitoring are essential to ensure stable performance:

Validate Data Sizes: Always verify that the dataset chosen for broadcasting is truly small both on disk and in-memory. Measure the record count and partition sizes before forcing a broadcast. This helps prevent unexpected OOM (out-of-memory) failures, which can occur when the dataset size exceeds available memory on the driver or executors.
Check Data Size and Record Count
- Count the Records: Before attempting a broadcast, run a simple df.count() on the small dataset. This confirms that the number of records is within an acceptable range.
- Estimate Data Size in Memory: Sometimes the dataset's on-disk size differs from its in-memory footprint. You can either use approximations from your data source’s statistics or compute a rough estimate using:
```
# Example in PySpark
data_size_in_bytes = df.rdd.map(lambda row: len(str(row))).sum()
print("Approximate in-memory size (bytes):", data_size_in_bytes)
```
- While this isn’t exact, it provides an estimate that can be compared against thresholds like spark.sql.autoBroadcastJoinThreshold

Threshold Validation before Forcing a Broadcast

Compare Against Broadcast Thresholds: Before performing an explicit broadcast, validate that the data size is below the configured threshold (e.g., 10MB, 30MB, or a custom value in your Spark configuration). This might involve:

broadcast_threshold = int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold").replace("b", ""))
# Assume approximate_size holds our computed or estimated size of the dataset in bytes.
if approximate_size < int(broadcast_threshold):
    print("Proceed with broadcast")
    # Then use broadcast
    from pyspark.sql.functions import broadcast
    df_broadcasted = broadcast(df)
else:
    print("Data too large; do not broadcast")

This validation helps avoid unintentionally broadcasting a dataset that is too big, potentially causing an OOM error.

Monitor Resource Usage: Leverage Spark’s UI and logging mechanisms to track metrics like GC (garbage collection) activity, memory usage, and broadcast sizes. The smallest available executor memory sets the limit, so ensure that the broadcast data comfortably fits on each node.
Use Adaptive Query Execution (AQE) Carefully: While Spark’s AQE can convert joins to BHJ at runtime, explicitly broadcasting small datasets using hints or functions like broadcast(df) can bypass the overhead of shuffling. However, avoid hardcoding broadcast hints unless you are confident of the dataset's size, as data volumes may fluctuate in production workloads.
Configure Thresholds Cautiously: Adjust configurations such as spark.sql.autoBroadcastJoinThreshold (and related thresholds in environments like Databricks) based on current cluster resources. For drivers with high memory (32GB+), thresholds can be increased, but setting these too high risks overwhelming your system if data volumes grow unexpectedly.
Plan for Scalability and Edge Cases: Implement safeguards within your production pipelines. For instance, include runtime validations or logic to disable broadcasting dynamically when data sizes approach critical limits. This is especially important for pipelines handling dynamic or streaming data where bursts of data could otherwise lead to system instability.
If you’re running a driver with a lot of memory (32GB+), you can safely raise the broadcast thresholds to something like 200MB

set spark.sql.autoBroadcastJoinThreshold = 209715200;
set spark.databricks.adaptive.autoBroadcastJoinThreshold = 209715200;

Why do we need to explicitly broadcast smaller tables if AQE can automatically broadcast smaller tables for us? The reason for this is that AQE optimizes queries while they are being executed.
- Spark needs to shuffle the data on both sides and then only AQE can alter the physical plan based on the statistics of the shuffle stage and convert to broadcast join
- Therefore, if you explicitly broadcast smaller tables using hints, it skips the shuffle altogether and your job will not need to wait for AQE’s intervention to optimize the plan
Never broadcast a table bigger than 1GB because broadcast happens via the driver and a 1GB+ table will either cause OOM on the driver or make the drive unresponsive due to large GC pauses
Please take note that the size of a table in disk and memory will never be the same. Delta tables are backed by Parquet files, which can have varying levels of compression depending on the data. And Spark might broadcast them based on their size in the disk — however, they might actually be really big (even more than 8GB) in memory after the decompression and conversion from column to row format. Spark has a hard limit of 8GB on the table size it can broadcast. As a result, your job may fail with an exception in this circumstance. In this case, the solution is to either disable broadcasting by setting spark.sql.autoBroadcastJoinThreshold to -1 and do the explicit broadcast using hints (or the PySpark broadcast function) of the tables that are really small in the disk as well as in memory, or set the spark.sql.autoBroadcastJoinThreshold to smaller values like 100MB or 50MB instead of setting the threshold to -1.
The driver can only collect up to 1GB of data in memory at any given time, and anything more than that will trigger an error in the driver, causing the job to fail. However, since we want to broadcast tables larger than 10MB, we risk running into this problem. This problem can be solved by increasing the value of the following driver configuration.
- Please keep in mind that because this is a driver setting; it cannot be altered once the cluster is launched. Therefore, it should be set under the cluster’s advanced options as a Spark config. Setting this parameter to 8GB for a driver with >32GB memory seems to work fine in most circumstances. In certain cases where the broadcast hash join is going to broadcast a very large table, setting this value to 16GB would also make sense.
- In Photon, we have the executor-side broadcast. So, you don’t have to change the following driver configuration if you use a Databricks Runtime (DBR) with Photon.

spark.driver.maxResultSize 16g

Final Thoughts

In summary, Broadcast Hash Join is a fast and efficient joining strategy in Spark for skewed or unbalanced joins where one dataset is significantly smaller. It avoids the expensive shuffling of the larger dataset by replicating the small data across all executors, enabling quick local hash lookups. However, its effectiveness depends heavily on the small dataset fitting in memory on the driver and executors. Forcing broadcasts should be done judiciously, with thorough validations in production to prevent resource exhaustion and associated failures.

By understanding the details of how BHJ operates and its configurations, you can better optimize your Spark jobs and manage performance, especially in environments like Databricks where adaptive query execution and executor-side optimizations further enhance its capabilities.

How the Process Works

BHJ operates in two main phases:

Broadcast Phase:
- Collection and Broadcast: The small table is first collected by the Spark driver. After collection, the data is broadcast to all the executors across the cluster.
- Local Caching: Once received on each node, the small dataset is cached in memory as a read-only broadcast variable. This ensures that the data is immediately available for the join process without any further data movement.
Hash Join Phase:
- Building a Hash Map: Each executor creates an in-memory hash map from the broadcasted dataset. The hash map is built using the join key.
- Local Join Operation: As the larger dataset is processed, every row in each partition is checked against the hash map for matching join keys. Because the small dataset is already available locally, this lookup is very fast and eliminates the need for shuffling data across the network.

Since no sort or extra merge steps are required, this one-pass in-memory lookup per partition makes the Broadcast Hash Join particularly quick, especially in common scenarios like joining large fact tables with much smaller dimension tables (a typical star schema pattern).

Spark Join Strategies Explained: Shuffle Hash

Canadian Data Guy — Thu, 10 Apr 2025 14:00:00 GMT

1. Introduction

Modern big data applications often require joining huge datasets efficiently. Choosing the right join strategy is critical to optimize performance and resource usage. Apache Spark offers several join methods, including broadcast joins, sort-merge joins, and shuffle hash joins. SHJ stands out as a middle-ground approach:

It shuffles both tables like sort-merge joins to align data with the same key.
Instead of sorting, it builds an in-memory hash table for the smaller dataset per partition and probes it with rows from the larger dataset.

This dual approach has the potential to improve execution time by reducing the sorting overhead but demands careful memory management.

2. Understanding Shuffle Hash Join

Shuffle Hash Join is best understood as a hybrid that borrows elements from two traditional join methods:

Sort Merge Join (SMJ)
- Mechanism: Both datasets are sorted by the join key and then merged.
- Pros: Reliable for large datasets.
- Cons: Sorting is CPU intensive.
Broadcast Hash Join (BHJ)
- Mechanism: The smaller table is broadcast to all nodes, and each executor performs a local hash join.
- Pros: Eliminates shuffling.
- Cons: Limited by broadcast size, not suitable when the smaller table exceeds available memory on executors.

How SHJ Differentiates Itself:

Key Step: It shuffles both datasets based on the join key so that every partition contains matching keys.
In-Partition Operation: Instead of sorting the data in each partition, Spark builds a hash table from the smaller dataset's partition and then probes that table with each row from the larger dataset.
Memory Sensitivity: The approach assumes that each partition of the smaller side can be held in memory, which is crucial for performance and avoiding runtime errors.

Key Concepts to Remember:
No Sorting: Eliminates the costly sort phase.
Memory Requirement: High dependency on the ability to fit the hashed partition in memory, risking OOM errors if miscalculated.

3. When to Use SHJ

Historical Perspective

Pre-Spark 3.0:
Spark defaulted to Sort Merge Join for equality-based joins due to the risk of OOM when building in-memory hash tables.
Spark 3.x and Beyond:
With enhancements like Adaptive Query Execution (AQE), Spark can dynamically decide to use SHJ when it detects that:
- The smaller dataset, after partitioning, is of manageable size.
- Avoiding the expensive sorting operation is beneficial for performance.

Practical Scenarios

Moderately Small Datasets:
When one dataset is small enough that its partitions are lightweight (e.g., 5 MB per partition out of 5 GB divided across 1000 partitions), yet not small enough for a broadcast join.
High Sorting Overhead:
When joining a massive fact table (e.g., 1 TB) with a dimension table that is too big to broadcast but small enough per partition, the cost of sorting the entire dataset (as in SMJ) may dominate and thus SHJ becomes more efficient.

Decision Factors

Estimated Partition Size:
Spark’s optimizer checks if the estimated per-partition size of the smaller table is below a threshold (set via spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold).
Configuration and Hints:
Users can guide Spark’s optimizer using hints like /*+ SHUFFLE_HASH(tab) */ or disable sort-merge joins by toggling spark.sql.join.preferSortMergeJoin.
spark.conf.set("spark.sql.join.preferSortMergeJoin","false")

4. How SHJ Works

The execution of a Shuffle Hash Join can be understood through two primary phases, with some literature breaking it into a three-phase model for clarity.

A. Shuffle Phase

Objective:
Bring together all rows associated with a given join key within the same partition.

Process:

Repartitioning:
Both datasets are re-distributed (shuffled) using the join key as the partitioning key. Note that both sides are shuffled – so network cost is still incurred for both datasets.
Data Co-location:
Post-shuffle, each partition will hold all the relevant rows for a specific range of join keys.
Network I/O:
While shuffling ensures correct join semantics, it incurs the cost of network communication for both datasets.

Example Scenario:

Imagine two datasets, Person and Address, initially spread across different partitions. In the shuffle phase, rows with the same key (e.g., A001) are sent to the same partition. This guarantees that later join operations will have all matching keys available on the same executor.

B. Hash Join Phase

After the shuffle phase, the join is executed within each partition through these steps:

Hash Table Creation:
- Selection:
  Spark selects the smaller dataset based on statistics or join hints.
- Building the Hash Table:
  For every partition, Spark creates an in-memory hash table that maps join keys to the associated rows.
Probing the Hash Table:
- Streaming Data:
  The larger dataset’s rows are processed sequentially within the partition.
- Lookup and Join:
  For each row in the larger dataset, the hash table is queried using the join key. If a match exists, Spark produces the joined row as output.

Because no sort is done, if the data per partition is large, the hash table may also be large. Spark assumes the build side will fit in memory. If it doesn’t, the task can spill partitions of the build side to disk (Spark has some support for spilling hash tables, but it is more complex than spilling a sort). In worst cases, an SHJ can run out of memory if the hash table grows too big, causing the executor to OOM. This is why Spark is conservative in using SHJ unless it’s confident the partitions are small enough

Conceptual Diagram:

Imagine a partition where:

The smaller dataset’s partition (say, 5 MB worth of data) is fully loaded into a hash table.
The larger dataset streams through, and for each key, Spark quickly checks the in-memory hash table for corresponding rows.

This operation is performed concurrently across all partitions on different worker nodes.

Alternative Three-Phase View

For some, a detailed three-phase breakdown clarifies the process:

Shuffle:
Repartition both datasets so that all rows sharing the same join key are co-located.
Hash Table Creation:
For each partition, build the in-memory hash table using the smaller dataset.
Hash Join:
Join the larger dataset’s partition by probing the hash table.

This view underlines the importance of parallel execution, where each worker node processes its partitions independently, which is key to Spark’s scalability.

5. Supported Join Types

Shuffle Hash Join is designed to work primarily with equi-joins. In Apache Spark, it supports:

Inner Joins:
Only matching rows are returned.
Left, Right, Semi, and Anti Joins:
These join types function well as long as the join condition is based on equality.

Additional Notes:

Full Outer Join:
Initially, SHJ did not support full outer joins in Spark 3.0 but was later introduced in Spark 3.1+.
Non-equi Joins and Cross Joins:
SHJ does not naturally handle cross joins or non-equi conditions. In such cases, Spark falls back on other, more suitable join strategies.

6. Performance Characteristics & Trade-Offs

Understanding the performance implications of SHJ is critical for designing robust, high-performance Spark jobs.

Advantages

No Sorting Required:
- By eliminating the sort step used in SMJ, SHJ significantly reduces CPU overhead.
Efficient CPU Usage:
- Hash functions and probing operations are generally less costly than sorting large datasets.
Parallel Execution:
- The join is processed in parallel across partitions, making it scalable across large clusters.

Considerations and Pitfalls

Memory Sensitivity:
- Build Side Dependency:
  Every partition on the smaller side must fit in memory. If a partition exceeds available memory, it may cause disk spills or even OOM errors.
- Configuration Challenges:
  Incorrect estimations or misconfigured thresholds can lead to failures. Monitoring and adjusting Spark’s parameters is essential.
Data Skew:
- Uneven Distribution:
  A heavily skewed join key might result in one partition holding a disproportionate amount of data, dramatically increasing memory requirements for that partition.
- Mitigation Strategies:
  Use techniques like increasing the number of shuffle partitions (via spark.sql.shuffle.partitions) or applying custom salting techniques.
Network I/O:
- While SHJ saves on CPU cycles, it does not reduce the network cost of shuffling. If your workload is network-bound, the benefits of SHJ may be limited.
Fallback and Spilling:
- If the hash table grows too large, Spark may attempt to spill data to disk. However, disk spilling is less efficient and can severely impact performance.

7. SHJ Compared to Other Join Strategies

A clear comparison can help decide when to use SHJ over other join methods:

AspectBroadcast Hash Join (BHJ)Sort Merge Join (SMJ)Shuffle Hash Join (SHJ)When to UseVery small tables (typically <10 MB by default)Large tables where sorting is tolerableModerately small build side that cannot be broadcast; avoid sorting overheadSorting RequirementNo sorting; smaller dataset is broadcastedSorting required across partitionsNo sorting within partitions; uses in-memory hash tableMemory ImpactMinimal memory impact on executorsUses more CPU for sortingRequires sufficient memory per partition for hash tablesNetwork CostMinimal network I/O (broadcast eliminates shuffle)High network I/O due to data shufflingSame network cost as SMJ

Key Takeaways:

BHJ is best when the smaller table is extremely small.
SMJ is a general-purpose join that is robust for large datasets.
SHJ strikes a balance by avoiding the heavy sorting cost when the per-partition memory size is manageable.

Shuffle hash join over sort-merge join

In most cases Spark chooses sort-merge join (SMJ) when it can’t broadcast tables. Sort-merge joins are the most expensive ones. Shuffle-hash join (SHJ) has been found to be faster in some circumstances (but not all) than sort-merge since it does not require an extra sorting step like SMJ. There is a setting that allows you to advise Spark that you would prefer SHJ over SMJ, and with that Spark will try to use SHJ instead of SMJ wherever possible. Please note that this does not mean that Spark will always choose SHJ over SMJ. We are simply defining your preference for this option.

set spark.sql.join.preferSortMergeJoin = false

Databricks Photon engine also replaces sort-merge join with shuffle hash join to boost the query performance.

Setting the preferSortMergeJoin config option to false for each job is not necessary. For the first execution of a concerned job, you can leave this value to default (which is true).

If the job in question performs a lot of joins, involving a lot of data shuffling and making it difficult to meet the desired SLA, then you can use this option and change the preferSortMergeJoin value to false

8. Configuration and Tuning Best Practices

Optimizing SHJ involves careful configuration and continuous monitoring. Below are some best practices.

A. Adaptive Query Execution (AQE)

What is AQE?
Adaptive Query Execution dynamically adapts the physical plan based on runtime statistics. With Spark 3.x, AQE can convert a sort-merge join to a shuffle hash join if it detects that partition sizes are favorable.

Configuration Example:

// Set AQE threshold such that if post-shuffle partition size is below 64MB, Spark uses SHJ. spark.conf.set("spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold", "64MB")

This dynamic adjustment helps balance between CPU use and memory load without manual intervention.

B. Join Hints and Configurations

Explicit Hints:
When you know the data characteristics, you can direct Spark to use SHJ via hints:

// Using a hint to explicitly request a Shuffle Hash Join val dfJoined = factTable.join(dimensionTable.hint("SHUFFLE_HASH"), "joinKey") dfJoined.explain() // The physical plan should show ShuffledHashJoin

Disabling SMJ Preference:
For cases where SHJ is preferred over SMJ, you can adjust the setting as follows:

// Tell Spark to favor hash-based join strategies over sort-merge join. spark.conf.set("spark.sql.join.preferSortMergeJoin", "false")

C. Monitoring and Debugging

Using the Spark UI:

Partition Metrics:
Monitor the size and distribution of shuffle partitions to ensure they meet expected thresholds.
Task Execution Details:
Observe tasks’ memory usage and CPU times. Unexpected OOM errors or high spill metrics may indicate misconfigured thresholds or skewed data.

Log Analysis:

AQE Logs:
When AQE is enabled, logs will show if the join strategy was dynamically switched.
Executor Logs:
Pay attention to memory allocation logs and warnings about data spills.

9. Practical Example

Let’s consider a real-world scenario to solidify our understanding. Suppose you are joining a large fact table with a moderately sized dimension table:

Fact Table: ~1 TB of transactional data.
Dimension Table: ~5 GB of reference data.

Rationale:
Broadcasting a 5 GB table is infeasible in this scenario, but if you partition the 5 GB table into 1000 slices, each partition is only about 5 MB. This makes it an ideal candidate for a shuffle hash join.

Implementation Example in Spark (Scala):

// Assuming factTable and dimensionTable are pre-defined DataFrames val dfJoined = factTable.join( dimensionTable.hint("SHUFFLE_HASH"), Seq("joinKey") // Using column(s) that define the join condition ) // Explain the plan to verify the join strategy dfJoined.explain(true) // Expected outcome: // The physical plan should display an operator "ShuffledHashJoin" // indicating that Spark is using SHJ for the join.

What to Look For:

Physical Plan Inspection:
Look for the ShuffledHashJoin operator in the explain plan output.
Resource Usage:
Monitor executor memory usage and check that each partition from the smaller dimension table fits within the allotted memory, avoiding spills or OOM errors.

ShuffledHashJoin [id1#3], [id2#8], Inner, BuildRight
:- Exchange hashpartitioning(id1#3, 200)
:  +- LocalTableScan [id1#3]
+- Exchange hashpartitioning(id2#8, 200)
   +- LocalTableScan [id2#8]

10. Databricks platform specific insights

Databricks generally relies on BHJ and SMJ under the hood, and uses SHJ in a more limited, adaptive way. Under AQE, Databricks might start a join as a sort-merge join but then convert it to a shuffled hash join at runtime if it finds that each partition’s size is below a threshold (and thus can fit in memory).

This is an optimization: Spark saves the cost of sorting when it realizes it wasn’t needed. By default, this conversion is off (threshold = 0) on vanilla Spark 3.2, but Databricks may enable it or allow setting it. If using hints, you can explicitly ask for a SHJ: e.g., .hint("SHUFFLE_HASH") in DataFrame API or SQL hints. This can be useful if you know one side is moderately small but Spark’s stats are missing. Always ensure that the hint-targeted side will be small per partition; otherwise, you might get memory errors.

Databricks’ strong skew mitigation helps SHJ as well – if one partition is skewed and would OOM an SHJ, AQE’s skew join handling could split that partition and even fall back to a sort-merge or a replicated join for that partition if necessary. Also, note that Photon (Databricks’ vectorized engine) has an improved hashed join implementation that can spill gracefully and use multiple threads per join, which makes SHJ more viable for large data in Photon. In standard Spark, SHJ is single-threaded per task for the join itself (just like SMJ merge is single-threaded per task).

11. Conclusion

Shuffle Hash Join (SHJ) provides a balanced approach by eliminating the high cost of sorting that is present in Sort Merge Joins, while sidestepping the broadcast size limitations of Broadcast Hash Joins. By shuffling data to co-locate matching join keys and then using an in-memory hash table to perform the join, SHJ offers:

Improved CPU efficiency due to reduced sorting overhead.
Scalability when the smaller dataset can be effectively partitioned.
A flexible mechanism that can adapt to runtime data sizes through AQE.

However, SHJ requires meticulous tuning and monitoring:

Memory Utilization:
Ensure that each partition’s hash table fits in memory.
Data Skew:
Address uneven data distributions to prevent performance bottlenecks.
Network Costs:
Understand that while CPU usage may decrease, shuffling still incurs network overhead.

By leveraging configuration settings, join hints, and adaptive query execution, data engineers can optimize their Spark workloads using SHJ. This detailed understanding equips you with the knowledge to carefully evaluate when SHJ is the right tool for your data joining needs, ensuring robust and efficient Spark application performance.

Spark Join Strategies Explained: Sort Merge Join

Canadian Data Guy — Thu, 10 Apr 2025 05:22:09 GMT

What is it:

Sort-Merge Join is the default join strategy in Spark for large datasets that don’t qualify for a broadcast. It involves shuffling and sorting both sides of the join on the join key, then streaming through the sorted data to merge matching keys. SMJ is robust and scalable: it can handle very large tables and all join types (inner, outer, etc.), at the cost of more network and CPU usage.

How it works

Spark will use a Sort-Merge Join when neither side is small enough to broadcast (or if the join type is not supported by BHJ). The execution has three main phases

Shuffle Phase: In the shuffle phase, both input datasets are repartitioned (shuffled) across the cluster nodes based on the join keys. This operation ensures that matching keys from both datasets reside within the same partitions on executors. The shuffle is an expensive network operation involving data redistribution across nodes. Each executor receives and transmits data based on the key distribution. By default, Spark employs 200 partitions (spark.sql.shuffle.partitions). In the physical plan, this shows up as Exchange hashpartitioning(...) on each side of the join
Sort Phase: Within each partition, Spark sorts the records by the join key. Each side is sorted independently. The plan will have local Sort operators after the exchange on each side. The output is that in partition i, both datasets are sorted by key. Sorting is an expensive step (O(n log n) per partition). If the data is already partitioned and sorted (e.g. bucketing and sorting on the join key), Spark may skip the shuffle and/or sort – but this requires specific conditions (like both sides being bucketed by the join key with the same number of partitions).
Merge Phase: Once each partition has sorted data from both sides, Spark performs a merge join: it iterates through the two sorted lists and finds matching keys, similar to how one would merge two sorted files. Because the data is sorted, Spark can do this efficiently by advancing pointers in each list, without nested loops. This merge join operation is efficient—linear time complexity per partition—enabling rapid matching without the need for nested loops. The output of each task is the joined records for that partition’s key range.

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
  +- SortMergeJoin [id#320L], [id#335L], Inner
       :- Sort [id#320L ASC NULLS FIRST], false, 0
       :  +- Exchange hashpartitioning(id#320L, 36), ENSURE_REQUIREMENTS, [id=#1018]
       :    +- Filter isnotnull(id#320L)
       :     +- Scan ExistingRDD[id#320L,col#321]
               +- Sort [id#335L ASC NULLS FIRST], false, 0
                +- Exchange hashpartitioning(id#335L, 36), ENSURE_REQUIREMENTS, [id=#1019]
                  +- Filter isnotnull(id#335L)
                   +- Scan ExistingRDD[id#335L,int_col#336L]

Execution details

Sort-Merge join will span multiple stages in the Spark DAG. Typically, you’ll have one stage (or stages) to produce the shuffle partitions for side A, another for side B, and then a final stage where the actual join (merge) happens. In Spark UI’s DAG visualization, you might see something like: both tables read in earlier stages, then a stage where “Exchange -> Sort -> WholeStageCodegen -> SortMergeJoin” occurs

A Spark DAG visualization of a Sort-Merge Join. Both tables are read and then shuffled (Exchange) so that matching keys co-locate. Each partition then sorts its chunk of data on the join key and merges the two sorted streams to output joined rows. (Some upstream stages show as “skipped” because their output was cached for reuse in this example.)

Supported join types: All join types are supported by SMJ for equality conditions – inner, left, right, full outer, semi, anti. It’s the fallback for any join that can’t use a more specialized strategy. Even non-equi joins (like inequalities) can be executed with a sort-merge-like approach if one side is small (Spark might use a Broadcast NLJ for those), but typically equi-joins are where SMJ is used. If you have a full outer join or if both sides are huge, SMJ is usually the plan Spark will choose. (Full outer join cannot be executed as a pure hash join in Spark 2.x, so SMJ was the only choice; Spark 3.1 introduced a shuffle hash algorithm for full outer, but SMJ is still often used.)

Why is it the most stable join?

Sort-Merge Join is network and CPU intensive. It performs a full shuffle of both datasets – which means network I/O proportional to the data size – and a sort of each partition. The memory usage during the sort phase can be high; Spark uses external sort which will spill to disk if a partition’s data doesn’t fit in memory. Unlike SHJ, SMJ is not all-or-nothing in memory: if a task has more data than RAM, it will write sorted runs to disk and merge them (graceful degradation).

This is why SMJ is considered stable for large data – it won’t crash for memory reasons, at worst it will spill and slow down. Still, you want to avoid excessive spilling by tuning partition sizes (Databricks often sets the default shuffle partitions to a high number or uses AQE to auto-tune partition counts).

Because both sides are shuffled, SMJ is symmetric – both large and small tables incur shuffle cost. The algorithm doesn’t build big hash tables, so it can handle very large inputs (even beyond memory) as long as you accept the sorting cost. One positive aspect is that SMJ streaming merge has low overhead per record once sorted, and if data is somewhat presorted or partitioned, the cost might be less than worst-case.

Databricks-specific insights

Databricks Runtime by default enables Adaptive Query Execution (AQE), which can optimize sort-merge joins in two major ways:

Dynamic partition coalescing – after shuffle, if many partitions are small, Databricks can coalesce them to reduce task overhead
Skew handling – if some partitions are extremely large (skewed), Databricks can split those into multiple tasks to avoid stragglers

We will discuss skew handling separately, but it’s important that with AQE, SMJ is not as rigid as it once was. Databricks also collects detailed statistics to decide join strategies: if the optimizer has reliable size estimates (via cost-based optimization), it might avoid SMJ in favor of BHJ when appropriate. However, when dealing with truly large tables where neither side is small, SMJ will be chosen because it’s the most general and robust approach.

Advanced Performance Tuning Strategies

While Spark handles the heavy lifting, you can tune SMJ performance by managing the shuffle and sort behavior:

Partition sizing: Adjust spark.sql.shuffle.partitions so that each partition after shuffle is a reasonable size (Databricks often aims for ~128 MB per partition as a balance between parallelism and overhead). Too few partitions (huge partitions) mean slow sorts and potential disk spills; too many (tiny partitions) mean excessive task scheduling overhead. AQE can auto-coalesce partitions that are smaller than spark.sql.adaptive.advisoryPartitionSizeInBytes (default 64MB)
spark.apache.org
Take advantage of sorting where possible: If your data is bucketed and sorted on the join keys (and both sides have the same number of buckets and join key bucketing), Spark can use a join without shuffle (it still sorts each bucket if not sorted, but avoids data movement). On Databricks, Delta Lake can maintain clustering (Z-order or sorting) on keys; while Spark does not automatically detect sort order for skipping the sort stage, having data clustered can improve CPU cache efficiency during the merge.
Push down filters and projections: Reduce data size before the join. SMJ’s cost is super linear in data volume (due to sorting). If you can filter out unnecessary rows or columns (thus less data to shuffle), do it first. The Catalyst optimizer should push filters, but be mindful when writing queries (e.g., filter as early as possible in the query plan). Also, dropping unused columns means less data is carried through the shuffle.
Monitor for skew: SMJ is particularly vulnerable to skewed keys: if one key accounts for a huge fraction of data, one shuffle partition will be enormous and the merge task for that partition will be a straggler. We’ll discuss skew mitigation soon (Databricks can automatically split skewed partitions. If you suspect skew, the Spark UI’s stage detail can show if one task processed far more data than others.

When to use SMJ

Typically, you don’t force a sort-merge join; Spark will use it by default for large data. But you might choose to use an SMJ (or let Spark use it) in cases where both datasets are large and similar in size, or when you’re doing a full outer join (which BHJ can’t handle). If one side can be broadcast but you choose not to (perhaps due to risk of OOM or because it’s just borderline size), SMJ will handle it gracefully. SMJ is also the strategy that can cope with lack of statistics: if Spark isn’t sure of sizes, it errs on the side of SMJ because it won’t blow up memory. On Databricks, if you disable adaptive execution or broadcasting, you are essentially forcing SMJ for all joins.

Common Pitfalls

Inadequate shuffle partition tuning, leading to excessive disk spills or overhead from numerous tiny partitions.
Failure to minimize shuffle volume by removing unnecessary columns.
Ignoring or inadequately handling data skew.
Misjudging broadcast opportunities by incorrectly assessing dataset size (rely on in-memory exchange size, not disk size).

Pitfalls: The major downside of SMJ is performance degradation if not tuned. Mistakes include not accounting for data skew (leading to very slow tasks) and leaving the default shuffle partitions at 200 regardless of data scale. For instance, joining two 1 TB tables with 200 partitions would create ~5 GB partitions, likely causing massive spills; increasing partitions (or using AQE) would be necessary. Another common pitfall is forgetting that all columns of both sides are shuffled by default. Projecting out unneeded columns can make a huge difference in shuffle volume. Also, if you have multiple joins in a single query (like joining 3-4 tables), Spark might form a multi-way join plan – consider breaking a very large join into steps or using broadcasts for some legs to avoid an overly expensive single SMJ of many inputs.

Conclusion

Sort-Merge Join remains a foundational element in Spark's join strategies. Understanding its detailed mechanics—shuffle, sort, and merge phases. With careful tuning and vigilant analysis, SMJ can transform demanding Spark workloads into highly optimized, reliable operations. On Databricks, always keep AQE enabled for SMJ – it will automatically optimize partition counts and handle skew, making SMJ perform much better in practice than the static execution plans of the past.

Further Resources

Apache Spark Official Documentation – SQL Performance Tuning: Covers join strategy hints, adaptive execution, etc. (See “Join Strategy Hints” and “Adaptive Query Execution” in the Spark docs)
downloads.apache.org
Tuning Spark SQL queries for AWS Glue and Amazon EMR Spark jobs
Apache Spark Official Documentation – Adaptive Query Execution (AQE): Detailed explanation of AQE features like converting SMJ to BHJ/SHJ and skew join handling
downloads.apache.org
Databricks Documentation – Join Hints & Optimizations: Databricks-specific docs on join strategies, including the SKEW and RANGE hints, and how AQE is used on Databricks
docs.databricks.com
“How Databricks Optimizes Spark SQL Joins” – Medium (dezimaldata): A blog post (Aug 2023) summarizing Databricks’ techniques like CBO, AQE, range join and skew join optimizations
dezimaldata.medium.com
Spark Summit Talks on Joins and AQE: Videos like “Optimizing Shuffle Heavy Workloads” or “AQE in Spark 3.0” (by Databricks engineers) for a deeper understanding of the internals of join execution and tuning.
https://spark.apache.org/docs/latest/sql-performance-tuning.html#join-strategy-hints
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution
How Databricks Optimizes the Spark SQL Joins
Top 5 Mistakes That Make Your Databricks Queries Slow (and How to Fix Them)