Jun 6, 2025

Unlock comprehensive, practical solutions to conquer data skew in Apache Spark—step-by-step from basics to advanced strategies for perfectly balanced workloads and optimized job performance.

4 Comments

Jayasurya Pilli

Apr 21

Thank you for such a detailed explanation on solution options for handling data skew.

However, I have a question, probably a quick one...

Given that now we also have Liquid Clustering feature available in Databricks, should Liquid Clustering be considered as the first and the foremost recommended solution, even over the AQE please?

In other words, as a recommended approach, shouldn't Liquid Clustering be considered first, followed by AQE, then BroadcastHashJoin, then Salting.

Please correct me if I'm wrong in my understanding.

Reply (1)

Canadian Data Guy

Apr 21

In date engineering a universal role could be read the least, process the least and process the least most efficiently aka avoid doing the most expensive thing in a distributed system which is shuffle.

What do you think, should one choose broadcast hash join or liquid clustering or something else

Reply (1)

Jayasurya Pilli

Apr 21

I see what you mean. Thanks for the swift response.

Andrii Fadieiev

Jun 8, 2025

Great stuff, thanks for sharing!

Canadian Data Guy Unfiltered

A Deep Dive into Skewed Joins, GroupBy…