4 Comments
User's avatar
Jayasurya Pilli's avatar

Thank you for such a detailed explanation on solution options for handling data skew.

However, I have a question, probably a quick one...

Given that now we also have Liquid Clustering feature available in Databricks, should Liquid Clustering be considered as the first and the foremost recommended solution, even over the AQE please?

In other words, as a recommended approach, shouldn't Liquid Clustering be considered first, followed by AQE, then BroadcastHashJoin, then Salting.

Please correct me if I'm wrong in my understanding.

Canadian Data Guy's avatar

In date engineering a universal role could be read the least, process the least and process the least most efficiently aka avoid doing the most expensive thing in a distributed system which is shuffle.

What do you think, should one choose broadcast hash join or liquid clustering or something else

Jayasurya Pilli's avatar

I see what you mean. Thanks for the swift response.

Andrii Fadieiev's avatar

Great stuff, thanks for sharing!