Discussion about this post

User's avatar
Jayasurya Pilli's avatar

Thank you very much for the article. Very lucid explanation with detailed illustrations on such an advanced topic. Without those detailed illustrations, I would have found it difficult to visualize. Now, my understanding of it is crystal clear.

The timing of this article was perfect, as I was going to be doing some research of my own on the internet to have a clear understanding of the Spark Join strategies. Now I don't need. Your post saved me time and hassle in this regard. More importantly the clarity in my understanding.

However, I have a question and a clarification to ask about:

Question: For Broadcast Hash Join and Shuffle Hash Join, as I understand from this article, only INNER join is supported. Does that mean, OUTER join isn't supported? If that's the case, just curious to know as to why?

Also, I would like to seek clarification on the below:

Based on the "How it works" section of the article, I understand, Sort Merge join uses Sort phase followed by Merge phase, which is the main difference when compared to Shuffle Hash Join. However on Sort Merge join diagram itself, towards the middle of the diagram, under the "strategy" bullet point, I see a mention of the hash join? This is where I have a little bit of confusion. Are you suggesting that Sort Merge join still uses hash join too? Or, was it simply a case of copy-and-paste error from the Shuffle Hash Join diagram?

Could you please clarify?

Regards,

Jayasurya Pilli

Data Engineer

Expand full comment
3 more comments...

No posts