4 Comments
User's avatar
Jayasurya Pilli's avatar

Thank you very much for the article. Very lucid explanation with detailed illustrations on such an advanced topic. Without those detailed illustrations, I would have found it difficult to visualize. Now, my understanding of it is crystal clear.

The timing of this article was perfect, as I was going to be doing some research of my own on the internet to have a clear understanding of the Spark Join strategies. Now I don't need. Your post saved me time and hassle in this regard. More importantly the clarity in my understanding.

However, I have a question and a clarification to ask about:

Question: For Broadcast Hash Join and Shuffle Hash Join, as I understand from this article, only INNER join is supported. Does that mean, OUTER join isn't supported? If that's the case, just curious to know as to why?

Also, I would like to seek clarification on the below:

Based on the "How it works" section of the article, I understand, Sort Merge join uses Sort phase followed by Merge phase, which is the main difference when compared to Shuffle Hash Join. However on Sort Merge join diagram itself, towards the middle of the diagram, under the "strategy" bullet point, I see a mention of the hash join? This is where I have a little bit of confusion. Are you suggesting that Sort Merge join still uses hash join too? Or, was it simply a case of copy-and-paste error from the Shuffle Hash Join diagram?

Could you please clarify?

Regards,

Jayasurya Pilli

Data Engineer

Expand full comment
Harathi Pasam's avatar

Hi Jayasurya,

Thanks so much for reading the article and for catching those mistakes in the infographic – I’ve just pushed an updated version with the fixes.

To answer your questions,

1. Broadcast and Shuffle Hash Joins in Spark only implement equi‐join variants (inner, left-semi, left-anti, and the “one-sided” outer joins when you broadcast the opposite side). Full-outer isn’t supported there because you’d need to build and track hash tables on both sides (doubling memory and bookkeeping). Spark falls back to Sort-Merge Join for any full-outer work. Since the code snippets show inner, it might have given the impression that only inner joins were supported. In the summary table, I've mentioned all the supported join types for each joining strategy.

2. It was indeed a Ctrl+C and Ctrl+V error, I have corrected the strategy box to reflect the right one.

Hope that clears things up! Let me know if you have any more questions.

Expand full comment
Jayasurya Pilli's avatar

Thank you. Much appreciated.

Expand full comment
Canadian Data Guy's avatar

I will let Harathi respond to your overall concern.

Looks like you might found more depth useful, we have 3 independent blogs on each type of join. Read here https://www.canadiandataguy.com/s/deep-dive

Expand full comment
ErrorError