Decode the Join: A Spark Data Engineer’s…

May 9, 2025

Understand when and why to use Broadcast, Shuffle, or Sort-Merge Joins in Spark— with clear visuals, real-world use cases, and strategy tips tailored for data engineers.

Read →

4 Comments

Jayasurya Pilli

May 11, 2025

Thank you very much for the article. Very lucid explanation with detailed illustrations on such an advanced topic. Without those detailed illustrations, I would have found it difficult to visualize. Now, my understanding of it is crystal clear.

The timing of this article was perfect, as I was going to be doing some research of my own on the internet to have a clear understanding of the Spark Join strategies. Now I don't need. Your post saved me time and hassle in this regard. More importantly the clarity in my understanding.

However, I have a question and a clarification to ask about:

Question: For Broadcast Hash Join and Shuffle Hash Join, as I understand from this article, only INNER join is supported. Does that mean, OUTER join isn't supported? If that's the case, just curious to know as to why?

Also, I would like to seek clarification on the below:

Based on the "How it works" section of the article, I understand, Sort Merge join uses Sort phase followed by Merge phase, which is the main difference when compared to Shuffle Hash Join. However on Sort Merge join diagram itself, towards the middle of the diagram, under the "strategy" bullet point, I see a mention of the hash join? This is where I have a little bit of confusion. Are you suggesting that Sort Merge join still uses hash join too? Or, was it simply a case of copy-and-paste error from the Shuffle Hash Join diagram?

Could you please clarify?

Regards,

Jayasurya Pilli

Data Engineer

Reply (2)

Harathi Pasam

May 12, 2025

Hi Jayasurya,

Thanks so much for reading the article and for catching those mistakes in the infographic – I’ve just pushed an updated version with the fixes.

To answer your questions,

1. Broadcast and Shuffle Hash Joins in Spark only implement equi‐join variants (inner, left-semi, left-anti, and the “one-sided” outer joins when you broadcast the opposite side). Full-outer isn’t supported there because you’d need to build and track hash tables on both sides (doubling memory and bookkeeping). Spark falls back to Sort-Merge Join for any full-outer work. Since the code snippets show inner, it might have given the impression that only inner joins were supported. In the summary table, I've mentioned all the supported join types for each joining strategy.

2. It was indeed a Ctrl+C and Ctrl+V error, I have corrected the strategy box to reflect the right one.

Hope that clears things up! Let me know if you have any more questions.

Reply (1)

Jayasurya Pilli

May 12, 2025

Thank you. Much appreciated.

Canadian Data Guy

May 11, 2025

I will let Harathi respond to your overall concern.

Looks like you might found more depth useful, we have 3 independent blogs on each type of join. Read here https://www.canadiandataguy.com/s/deep-dive

Canadian Data Guy Unfiltered

Decode the Join: A Spark Data Engineer’s…